I am try to delete an icons which appears in many rows of my csv file. When I create a dataframe object using pd.read_csv it shows a green squared check icon, but if I open the csv using Excel I see ✅ instead. I tried to delete using split function because the verification status is separated by | to the comment:
df['reviews'] = df['reviews'].apply(lambda x: x.split('|')[1])
I noticed it didn't detect the "|" separator when the review contains the icon mentioned above.
I am not sure if it is an encoding problem. I tried to add encoding='utf-8' in pandas read_csv but It didn't solve the problem.
Thanks in advance.
I would like to add, this is a pic when I open the csv file using Excel.
You can remove non-latin characters using encode/decode methods:
>>> df
reviews
0 ✓ Trip Verified
1 Verified
>>> df['reviews'].str.encode('latin1', errors='ignore').str.decode('latin1')
0 Trip Verified
1 Verified
Name: reviews, dtype: object
Say you had the following dataframe:
reviews
0 ✅ Trip Verified
1 Not Verified
2 Not Verified
3 ✅ Trip Verified
You can use the replace method to replace the ✅ symbol which is unicode character 2705.
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
Here is the full example:
Code:
import pandas as pd
df = pd.DataFrame({"reviews":['\u2705 Trip Verified', 'Not Verified', 'Not Verified', '\u2705 Trip Verified']})
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
print(df)
Output:
reviews
0 Trip Verified
1 Not Verified
2 Not Verified
3 Trip Verified
I am fairly new to Python and have tried a few different things but I keep getting small syntax errors or just nothing being printed back.
I have a Excel sheet that has 4 columns with 40 rows.
Snippet of my Excel sheet.
Name Age Pet(s) Married
john 30 1 yes
mary 25 2 no
.
.
.
I want to create a conditional statement that says
if ((df['age'] > 20) & (df['pet'] > 2))
print that line.
I want to repeat the process until the code has reached the end of the data set. Above is my attempt at trying to create the conditional, however it is not printing anything. I am not having any issues opening and reading the excel file just the conditional statement.
FilePath = "location of file"
FileName = "Name of file"
df = pd.read_excel(FilePath + '//' + FileName + '.xlsx'
.
.
.
For i in range( 0, len(df['age'])
if ((df['age'] > 20) & (df['pet'] > 2))
print line
Does this fix your problem?
for i in range(len(df)):
if (df.loc[i, 'age'] > 20) and (df.loc[i, 'pet'] > 2):
print(df.loc[i])
You want to specify a certain row in age and pet columns using the .loc method. Also, use range(len(df)) to iterate through the number of rows in the entire dataframe instead of len(df['age']).
I am trying to accept a variable input of many search terms seperated by commas via html form (#search) and query 2 columns of a dataframe.
Each column query works on its own but I cannot get them to work together in a and/or way.
First column query:
filtered = df.query ('`Drug Name` in #search')
Second column query:
filtered = df.query ('BP.str.contains(#search, na=False)', engine='python')
edit
combining like this:
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
Gives the following error, highlighting the python identifier in the engine argument
SyntaxError: Python keyword not valid identifier in numexpr query
edit 2
The dataframe is read from an excel file, with columns:
Drug Name (containing a single drug name), BP, U&E (with long descriptive text entries)
The search terms will be input via html form:
search = request.values.get('searchinput').replace(" ","").split(',')
as a list of drugs which a patient may be on sometimes with the addition of specific conditions relating to medication use. sample user input:
Captopril, Paracetamol, kidney disease, chronic
I want the list to be checked against specific drug names and also to check other columns such as BP and U&E for any mention of the search terms.
edit 3
Apologies, but trying to implement the answers given is giving me stacks of errors. What I have below is giving me 90% of what I'm after, letting me search both columns including the whole contents of 'BP'. But I can only search a single term via the terminal, if I # out and swap the lines which collect the use input (taking it from the html form as apposed to the terminal) I get:
TypeError: unhashable type: 'list'
#app.route('/', methods=("POST", "GET"))
def html_table():
searchterms = []
#searchterms = request.values.get('searchinput').replace(" ","").split(',')
searchterms = input("Enter drug...")
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
return render_template('drugsafety.html', tables=[filtered.to_html(classes='data')], titles=['na', 'Drug List'])
<form action="" method="post">
<p><label for="search">Search</label>
<input type="text" name="searchinput"></p>
<p><input type="submit"></p>
</form>
Sample data
The contents of the BP column can be quite long, descriptive and variable but an example is:
Every 12 months – Patients with CKD every 3 to 6 months.
Drug Name BP U&E
Perindopril Every 12 months Not needed
Alendronic Acid Not needed Every 12 months
Allopurinol Whilst titrating - 3 months Not needed
With this line:
searchterms = request.values.get('searchinput')
Entering 'months' into the html form outputs:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – 3 months Not needed
All good.
Entering 'Alendronic Acid' into the html form outputs:
13 Alendronic Acid Not needed Every 12 months
Also good, but entering 'Perindopril, Allopurinol' returns nothing.
If I change the line to:
searchterms = request.values.get('searchinput').replace(" ","").split(',')
I get TypeError: unhashable type: 'list' when the page reloads.
However - If I then change:
filtered = df.query('`Drug Name` in #searchterms | BP.str.contains(#searchterms, na=False)', engine='python')
to:
filtered = df.query('`Drug Name` in #searchterms')
Then the unhashable type error goes and entering 'Perindopril, Allopurinol'
returns:
1 Perindopril Every 12 months Not needed
14 Allopurinol Whilst titrating – Every 3 months Not needed
But I'm now no longer searching the BP column for the searchterms.
Just thought that maybe its because searchterms is a list '[]' changed it t oa tuple '()' Didn't change anything.
Any help is much appreciated.
I am assuming you want to query 2 columns and want to return the row if any of the query matches.
In this line, the issue is that engine=python is inside query.
filtered = df.query ("('`Drug Name` in #search') and ('BP.str.contains(#search, na=False)', engine='python')")
It should be
df.query("BP.str.contains(#search, na=False)", engine='python')
If you do searchterms = request.values.get('searchinput').replace(" ","").split(','), it converts your string to list of words which will cause Unhashable type list error because str.contains expects str as input.
What you can do is use regex to search for search terms in list, it will look something like this:
df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')
What this does is it searches for all the individual words using regex. ('|'.join(#search) will be "searchterm_1|search_term2|..." and "|" is used to represent or in regex, so it looks for searchterm_1 or searchterm_2 in BP column value)
To combine the outputs of both queries, you can run those separately and concatenate the results
pd.concat([df.query("`Drug Name` in #search", engine='python'),df.query("BP.str.contains('|'.join(#search), na=False, regex=True)", engine='python')])
Also any string based matching will require your strings to match perfectly, including case. so you can maybe lowercase everything in dataframe and query. Similarly for space separated words, this will remove spaces.
if you do searchterms = request.values.get('searchinput').replace(" ","").split(',') on Every 12 months, it will get converted to "Every12months". so you can maybe remove the .replace() part and just use searchterms = request.values.get('searchinput').split(',')
Use sets. You can change the text columns to sets and check for intersection with the input. The rest is pure pandas. I never use .query because it is slow.
# change your search from list to set
search = set(request.values.get('searchinput').replace(" ","").split(','))
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
Demo:
import pandas as pd
search = set(["apple", "banana", "orange"])
df = pd.DataFrame({
"Drug Name": ["I am eating an apple", "We are banana", "nothing is here"],
"BP": ["apple is good", "nothing is here", "nothing is there"],
"Just": [1, 2, 3]
})
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
& (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
Updated:
I would want the results to also show We are banana, nothing is here and 2
That requires or which is Pandas' | instead of and which Pandas' $
filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
| (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]
print(filtered)
# Drug Name BP Just
# 0 I am eating an apple apple is good 1
# 1 We are banana nothing is here 2
If you want to search for text in all columns, you can first join all columns, and then check for search terms in each row using str.contains and the regular expression pattern that matches at least one of the terms (term1|term2|...|termN). I've also added flags=re.IGNORECASE to make the search case-insensitive:
# search function
def search(searchterms):
return df.loc[df.apply(' '.join, axis=1) # join text in all columns
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # strip spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # case-insensitive
# test search terms
for s in ['Alendronic Acid', 'months', 'Perindopril, Allopurinol']:
print(f'Search terms: "{s}"')
print(search(s))
print('-'*70)
Output:
Search terms: "Alendronic Acid"
Drug Name BP U&E
1 Alendronic Acid Not needed Every 12 months
----------------------------------------------------------------------
Search terms: "months"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
1 Alendronic Acid Not needed Every 12 months
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
Search terms: "Perindopril, Allopurinol"
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
----------------------------------------------------------------------
P.S. If you want to limit search to specific columns, here's a version that does that (with the default of searching all columns for convenience):
# search function
def search(searchterms, cols=None):
# search columns (if None, searches in all columns)
if cols is None:
cols = df.columns
return df.loc[df[cols].apply(' '.join, axis=1) # join text in cols
.str.contains( # check if it contains
'|'.join([ # regex pattern
x.strip() # remove spaces
for x in searchterms.split(',') # split by ','
]), flags=re.IGNORECASE)] # make search case-insensitive
Now if I search for months only in Drug Name and BP, it will not return Alendronic Acid where months is only found in U&E:
search('months', ['Drug Name', 'BP'])
Output:
Drug Name BP U&E
0 Perindopril Every 12 months Not needed
2 Allopurinol Whilst titrating - 3 months Not needed
Without having sample input data, I used a random generated dataset as a showcase:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Drug_Name':['Drug1','Drug2','Drug3','Drug2','Drug5','Drug3']*4,
'Inv_Type': ['X', 'Y']*12,
'Quant': np.random.randint(2,20, size=24)})
# Search 1
search = "Drug3"
df.query('Drug_Name==#search')
# Search 2
search2 = "Y"
df.query ('Inv_Type.str.contains(#search2, na=False)', engine='python')
# Combined (use booleans, such as & or | instead of and or or
df.query ('Drug_Name==#search & Inv_Type.str.contains(#search2, na=False)')
Please note that engine='python' should be avoided as stated in the documentation:
Likewise, you can pass engine='python' to evaluate an expression using
Python itself as a backend. This is not recommended as it is
inefficient compared to using numexpr as the engine.
That said, if you are hell-bent on using it, you can do it like this:
mask = df["Inv_Type"].str.contains(search2, na=False)
df.query('Drug_Name==#search & #mask')
Alternatvely, you can achive the same without using .query() at all:
df[(df['Drug_Name']==search) & df['Inv_Type'].str.contains(search2, na=False)]
Here is the question:
Write a program that computes the average learning coverage (the second column, labeled LC) and the highest Unique learner (the third column, labeled UL).
Both should be computed only for the period from June 2018 through May 2019.
Save the results in the variables mean_LC and max_UL.
The content of the .txt file is as below:
Date,LC,UL
1-01-2018,20045,687
1-02-2018,4536,67
1-03-2018,6783,209
1-04-2018,3465,2896
1-05-2018,456,27
1-06-2018,3458,986
1-07-2018,6895,678
1-08-2018,5678,345
1-09-2018,4576,654
1-10-2018,456,98
1-11-2018,456,8
1-12-2018,456,789
1-01-2019,876,98
1-02-2019,3468,924
1-03-2019,46758,973
1-04-2019,678,345
1-05-2019,345,90
1-06-2019,34,42
1-07-2019,35,929
1-08-2019,243,931
# Importing the pandas package.
import pandas as pd
# Reading the CSV formatted file using read_csv function.
df = pd.read_csv('content.txt')
# retraining only the data from 2018 June to 2019 May
#Filter your dataset here
df = df[ (df['Date'] >= '1-06-2018' ) & (df['Date'] <= '1-05-2019') ]
# Using the predefined pandas mean function to find the mean.
#To find average/ mean of column
mean_LC = df['LC'].mean()
# Using the predefined pandas max value function to find the Max value
#To find the Max UL
max_UL = df['UL'].max()
This link will give you an idea of how the code is actually working : https://www.learnpython.org/en/Pandas_Basics
Cracked it !!
with open("LearningData.txt","r") as fileref:
lines = fileref.read().split()
UL_list = []
sum = 0
for line in lines[6:18]:
sum += float(line.split(",")[1])
UL_list.append(line.split(",")[2])
max_UL = UL_list[0]
for i in UL_list:
if i> max_UL:
max_UL=int(i)
mean_LC = sum/12
print(mean_LC)
print(max_UL)
I'm asking help how to use the Python command: df=pd.read_csv('olympics.csv'). My intention is to use pandas to read this file, and determine how many countries have won more than 1 Gold medal.
Assumption: 'olympics.csv' resides in same directory as .py file. I tried #using the entire path inside parentheses, but that had no effect
#('/Users/myname/temp/intro_ds/week2/olympics.csv')
The error I receive when running this file in Bash is: KeyError:'Gold'
I'm using Python 2.7.10 on a MacBook, Unix
CODE:
import pandas as pd
df = pd.read_csv('olympics.csv')
only_gold = df.where(df['Gold'] > 0)
print only_gold()
olympics.csv has no column with name Gold, Silver or Bronze when you first convert it to csv. You have to rename column headers, skip some unnecessary rows and make an index.
To read olympics.csv, skip rows (if you need to, depends on your csv formatting) and Make an index on Team names.
import pandas as pd
df = pd.read_csv('olympics.csv', skiprows=1, index_col=0)
df.head()
This should give you results like this which has 01!, 02! instead of Gold, Silver in columns header.
To rename columns header to Gold, Silver and Bronze from 01!, 02! and 03!. Run the following
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
df.head()
Now you can make query like
df['Gold'] #for summer olympics Gold medals
df['Gold.1'] #for winter olympics Gold medals
df['Gold.2'] #for combined summer+winter Gold medals
Convert All-time_Olympic_Games_medal_table table to csv