Find rows in dataframe that contain words that are bigrams/trigrams

Find rows in dataframe that contain words that are bigrams/trigrams - python

This example is for finding bigrams:
Given:
import pandas as pd
data = [['tom', 10], ['jobs', 15], ['phone', 14],['pop', 16], ['they_said', 11], ['this_example', 22],['lights', 14]]
test = pd.DataFrame(data, columns = ['Words', 'Freqeuncy'])
test
I'd like to write a query to only find words that are separated by a "_" such that the returning df would look like this:
data2 = [['they_said', 11], ['this_example', 22]]
test2 = pd.DataFrame(data2, columns = ['Words', 'Freqeuncy'])
test2
I'm wondering why something like this doesn't work.. data[data['Words'] == (len> 3)]

To use a function you need to use apply:
df[df.apply(lambda x: len(x['Words']), axis=1)> 3]

The pandas way of doing it is like this:
import pandas as pd
data = [['tom', 10], ['jobs', 15], ['phone', 14],['pop', 16], ['they_said', 11], ['this_example', 22],['lights', 14]]
test = pd.DataFrame(data, columns = ['Words', 'Freqeuncy'])
test = test[test.Words.str.contains('_')]
test
To do the opposite, you can do:
test = test[~test.Words.str.contains('_')]

Related

How to break a current loop and go to the next loop when a condition is meet in python?

I have a data frame like below. Now I want to iterate through unique values of column Name and get the values of column Age when the Age is 10 and when the condition is meet the loop has to break and continue with the next loop. I tried to break it using while loop but it is not working. What is the best way to loop which can break the current loop once the condition is meet and go to the next loop?
Data Frame:-
import pandas as pd
data = [['tom', 10], ['nick', 5], ['juli', 4],
['tom', 11], ['nick', 7], ['juli', 24],
['tom', 12], ['nick', 10], ['juli', 15],
['tom', 14], ['nick', 20], ['juli', 17]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
Loop:-
for j in df['Name'].unique():
print(j)
o=0
t=[]
while o == 10:
for k in df['Age']:
if k == 10:
t.append(k)
o = k
output:-
tom
nick
juli
It it printing the values in column Name but not printing the values inside the while loop. How do I achieve it?

Do you mean something like this?
# Data Frame:-
import pandas as pd
data = [['tom', 10], ['nick', 5], ['juli', 4],
['tom', 11], ['nick', 7], ['juli', 24],
['tom', 12], ['nick', 10], ['juli', 15],
['tom', 14], ['nick', 20], ['juli', 17]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
# Loop:-
for j in df['Name'].unique():
print("Name:", j)
for i in df[df['Name']==j]['Age']:
print("Age:", i)
if i == 10:
print("Found age 10 for", j)
break

You can do something like this -
for name in df['Name'].unique():
matching_ages = []
# loop through age
for age in df['Age']:
if age == 10:
matching_ages.append(age)
break
# print the output
print(name, matching_ages)
This will return the output like this -

What is wrong here in colouring the Excel sheet?

Here I need to colour 'red' for rows with Age<13 and colur 'green' for rows with Age>=13. But the final 'Report.xlsx' isn't getting coloured. What is wrong here?
import pandas as pd
data = [['tom', 10], ['nick', 12], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df_styled = df.style.applymap(lambda x: 'background:red' if x < 13 else 'background:green', subset=['Age'])
df_styled.to_excel('Report.xlsx',engine='openpyxl',index=False)

Loosing column names converting back to dataframe from list

I have created a dataframe,i need to do two operations:
Converting to a list
converting the same list back to the dataframe with original column names.
Issue: i am loosing the column names when i first convert to a list and when i convert back to dataframe i am not getting those column names
Please help!
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#convert df to list
a=df.values.tolist()
#convert back to original dataframe
df1 = pd.DataFrame(a)
print(df1)
Current output
i am unable to get column names

You need pass columns names by df.columns, also if not default index is necessary pass it too:
df1 = pd.DataFrame(a, columns=df.columns, index=df.index)
If default RangeIndex in original DataFrame:
df1 = pd.DataFrame(a, columns=df.columns)
EDIT:
If need some similar structure use DataFrame.to_dict with orient='split' there are converted DataFrame to dictionary of columnsnames, index and data like:
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
d = df.to_dict(orient='split')
print (d)
{'index': [0, 1, 2],
'columns': ['Name', 'Age'],
'data': [['tom', 10], ['nick', 15], ['juli', 14]]}
And for original DataFrame use:
df2 = pd.DataFrame(d['data'], index=d['index'], columns=d['columns'])
print (df2)
Name Age
0 tom 10
1 nick 15
2 juli 14

Pandas get cell value by row NUMBER (NOT row index) and column NAME

data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'], index = [7,3,9])
display(df)
df.iat[0,0]
I'd like to return the Age in first row (basically something like df.iat[0,'Age']. Expected result = 10
Thanks for your help!

df['Age'].iloc[0] works too, similar to what Chris had answered.

Use iloc and Index.get_loc:
df.iloc[0, df.columns.get_loc("Age")]
Output:
10

Drop all data in a pandas dataframe

I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df

You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.

My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data

My favorite way is:
df = df[0:0]

Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)

If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])

This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find rows in dataframe that contain words that are bigrams/trigrams - python

To use a function you need to use apply: df[df.apply(lambda x: len(x['Words']), axis=1)> 3]

Related

How to break a current loop and go to the next loop when a condition is meet in python?

What is wrong here in colouring the Excel sheet?

Loosing column names converting back to dataframe from list

Pandas get cell value by row NUMBER (NOT row index) and column NAME

Drop all data in a pandas dataframe

Categories

Resources