I need to check whether the string ends with | or not.
Student,"Details"
Joe|"December 2017|maths"
Bob|"April 2018|History|Biology|Physics|"
sam|"December 2018|physics"
I have tried with the below code and it's not working as expected.
def Pipe_in_variant(path):
df = pd.read_csv(path, sep='|')
mask = (df['Details'])
result = mask.endswith(""|"")
print("...................")
print(result)
Your example input is unclear, however assuming you want to check is items in a column end with something, use str.endswith.
Example:
df = pd.DataFrame({'Details': ['ab|c', 'acb|']})
df['Details'].str.endswith('|')
output:
0 False
1 True
Name: Details, dtype: bool
printing the matching rows:
df[df['Details'].str.endswith('|')]
output:
Details
1 acb|
Note - I think first row of input should be Student|"Details" instead of Student,"Details".
Here is what you can do
import pandas as pd
dframe = pd.read_csv('input.txt', sep='|')
dframe['ends_with_vbar'] = dframe['Details'].str.endswith('|')
dframe
Output:
Student Details ends_with_vbar
0 Joe December 2017|maths False
1 Bob April 2018|History|Biology|Physics| True
2 sam December 2018|physics False
Then you can print the marked row as follows
for _, row in dframe[dframe['ends_with_vbar']].iterrows():
print(f'{row["Student"]} - {row["Details"]}')
Output:
Bob - April 2018|History|Biology|Physics|
Related
I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')
I'm trying to change the Strings "SLL" under the competitions column to "League" but when i tried this:
messi_dataset.replace("SLL", "League",regex = True)
It only changed the first "SLL" to "League" but then other strings that were "SLL" became "UCL. I have no idea why. I also tried changing regex = True to inlace = True but no luck.
https://drive.google.com/file/d/1ldq6o70j-FsjX832GbYq24jzeR0IwlEs/view?usp=sharing
https://drive.google.com/file/d/1OeCSutkfdHdroCmTEG9KqnYypso3bwDm/view?usp=sharing
Suppose you have a dataframe as below:
import pandas as pd
import re
df = pd.DataFrame({'Competitions': ['SLL', 'sll','apple', 'banana', 'aabbSLL', 'ccddSLL']})
# write a regex pattern that replaces 'SLL'
# I assumed case-irrelevant
regex_pat = re.compile(r'SLL', flags=re.IGNORECASE)
df['Competitions'].str.replace(regex_pat, 'league', regex=True)
# Input DataFrame
Competitions
0 SLL
1 sll
2 apple
3 banana
4 aabbSLL
5 ccddSLL
Output:
0 league
1 league
2 apple
3 banana
4 aabbleague
5 ccddleague
Name: Competitions, dtype: object
Hope it clarifies.
base on this Answer test this code:
messi_dataset['competitions'] = messi_dataset['competitions'].replace("SLL", "League")
also, there are many different ways to do this like this one that I test:
messi_dataset.replace({'competitions': 'SLL'}, "League")
for those cases that 'SLL' is a part of another word:
messi_dataset.replace({'competitions': 'SLL'}, "League", regex=True)
I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3
Similar to this question, but my CSV has a slightly different format. Here is an example:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
My desired output:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?
As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05
I have reproduced your file
With
df = pd.read_csv('e1.csv', index_col=None )
print (df)
Output
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.
Here is a less than optimal option:
df = pd.read_csv('s083838383.csv', sep='##$%^', engine='python')
header = df.columns[0]
print(df)
Why sep='##$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.
df looks like this:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
Then you could use str.extract to apply regex and expand the columns:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result is:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
If you need the starting and ending quotes stripped off of the details string values, you could do:
result['details'] = result['details'].str.strip('"')
If the details object items needs to be a dicts instead of strings, you could do:
from json import loads
result['details'] = result['details'].apply(loads)
I am fairly new to Pandas and I am working on project where I have a column that looks like the following:
AverageTotalPayments
$7064.38
$7455.75
$6921.90
ETC
I am trying to get the cost factor out of it where the cost could be anything above 7000. First, this column is an object. Thus, I know that I probably cannot do a comparison with it to a number. My code, that I have looks like the following:
import pandas as pd
health_data = pd.read_csv("inpatientCharges.csv")
state = input("What is your state: ")
issue = input("What is your issue: ")
#This line of code will create a new dataframe based on the two letter state code
state_data = health_data[(health_data.ProviderState == state)]
#With the new data set I search it for the injury the person has.
issue_data=state_data[state_data.DRGDefinition.str.contains(issue.upper())]
#I then make it replace the $ sign with a '' so I have a number. I also believe at this point my code may be starting to break down.
issue_data = issue_data['AverageTotalPayments'].str.replace('$', '')
#Since the previous line took out the $ I convert it from an object to a float
issue_data = issue_data[['AverageTotalPayments']].astype(float)
#I attempt to print out the values.
cost = issue_data[(issue_data.AverageTotalPayments >= 10000)]
print(cost)
When I run this code I simply get nan back. Not exactly what I want. Any help with what is wrong would be great! Thank you in advance.
Try this:
In [83]: df
Out[83]:
AverageTotalPayments
0 $7064.38
1 $7455.75
2 $6921.90
3 aaa
In [84]: df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000
Out[84]:
0 True
1 True
2 False
3 False
Name: AverageTotalPayments, dtype: bool
In [85]: df[df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000]
Out[85]:
AverageTotalPayments
0 $7064.38
1 $7455.75
Consider the pd.Series s
s
0 $7064.38
1 $7455.75
2 $6921.90
Name: AverageTotalPayments, dtype: object
This gets the float values
pd.to_numeric(s.str.replace('$', ''), 'ignore')
0 7064.38
1 7455.75
2 6921.90
Name: AverageTotalPayments, dtype: float64
Filter s
s[pd.to_numeric(s.str.replace('$', ''), 'ignore') > 7000]
0 $7064.38
1 $7455.75
Name: AverageTotalPayments, dtype: object