I am try to delete an icons which appears in many rows of my csv file. When I create a dataframe object using pd.read_csv it shows a green squared check icon, but if I open the csv using Excel I see ✅ instead. I tried to delete using split function because the verification status is separated by | to the comment:
df['reviews'] = df['reviews'].apply(lambda x: x.split('|')[1])
I noticed it didn't detect the "|" separator when the review contains the icon mentioned above.
I am not sure if it is an encoding problem. I tried to add encoding='utf-8' in pandas read_csv but It didn't solve the problem.
Thanks in advance.
I would like to add, this is a pic when I open the csv file using Excel.
You can remove non-latin characters using encode/decode methods:
>>> df
reviews
0 ✓ Trip Verified
1 Verified
>>> df['reviews'].str.encode('latin1', errors='ignore').str.decode('latin1')
0 Trip Verified
1 Verified
Name: reviews, dtype: object
Say you had the following dataframe:
reviews
0 ✅ Trip Verified
1 Not Verified
2 Not Verified
3 ✅ Trip Verified
You can use the replace method to replace the ✅ symbol which is unicode character 2705.
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
Here is the full example:
Code:
import pandas as pd
df = pd.DataFrame({"reviews":['\u2705 Trip Verified', 'Not Verified', 'Not Verified', '\u2705 Trip Verified']})
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
print(df)
Output:
reviews
0 Trip Verified
1 Not Verified
2 Not Verified
3 Trip Verified
I am trying to complete a script to store all the trail reports my company gets from various clearing houses. As part of this script I rip the data from multiple excel sheets (over 20 a month) and an amalgamate it in a series of pandas dataframes(organized in a timeline). Unfortunately when I try to output a new spreadsheet with the amalgamated summaries, I get a 'number stored as text' error from excel.
FinalFile = Workbook()
FinalFile.create_sheet(title='Summary') ### This will hold a summary table eventually
for i in Timeline:
index = Timeline.index(i)
sheet = FinalFile.create_sheet(title=i)
sheet[i].number_format = 'Currency'
df = pd.DataFrame(Output[index])
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
df.head()
df = df.set_index('Payment Type')
for r in dataframe_to_rows(df, index=True,header=True):
sheet.append(r)
for cell in sheet['A'] + sheet[1]:
cell.style='Pandas'
SavePath = SaveFolder+'/'+CurrentDate+'.xlsx'
FinalFile.save(SavePath)
using number_format = 'Currency' to format as currency did not resolve this, nor did my attempt to use the write only methond on the openpyxl documentation page
https://openpyxl.readthedocs.io/en/stable/pandas.html
Fundamentally this code is outputting the right index, headers, sheetname and formatting the only issue issue is the numbers stored as text from B3:D7.
Attached is an example month Output
example dataframe of the same month
0 Total Paid Net GST
Payment Type
Adjustments -2800 -2546 -254
Agency Upfront 23500 21363 2135
Agency Trail 46980 42708 4270
Referring Office Trail 16003 14548 1454
NilTrailPayment 0 0 0
I have a Data Science Related project. I need to remove a certain name from my DataFrame. here is what I attempted:
delete_row_1 = batsal[batsal["playerID"]=='giambja01'].index
remaining_players = batsal.drop(delete_row_1)
To test whether this worked I wrote this and got False:
'giambja01' in remaining_players['playerID']
False
It seems to have worked. and yet when i run the following code i get this:
remaining_players['playerID']
10836 giambja01
13287 heltoto01
2446 berkmla01
11336 gonzalu01
8271 drewjd01
25101 pujolal01
17276 lawtoma02
82 abreubo01
5395 catalfr01
10852 giambje01
22174 nevinph01
20635 mientdo01
6275 coninje01
11545 gracema01
20173 mclemma01
23005 ordonma01
24596 pierrju01
22418 nixontr01
5903 clarkto02
30281 sweenmi01
20688 millake01
18086 loducpa01
11810 grievbe01
3145 boonebr01
29869 stewash01
33183 whitero02
32039 vidrojo01
Name: playerID, dtype: object
I am attaching a sample DataFrame:
batsal = pd.DataFrame({'playerID':['giambja01' , 'damonjo01' , 'saenzol01'],'Sex':['M','M','M']})
Please let me know what I did wrong.
The issue is drop works with columns and not rows. Instead you have to designate the index of the item you would like to remove and the columns data should be removed from. You should try:
df.drop(index='giambja01', columns='1').
Try this, specifying the index:
remaining_players = batsal.drop(index=delete_row_1)
Find the documentation on the function here.
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I have two csv files with same columns name:
In file1 I got all the people who made a test and all the status (passed/missed)
In file2 I only have those who missed the test
I'd like to compare file1.column1 and file2.column1
If they match then compare file1.column4 and file2.column4
If they are different remove item line from file2
I can't figure how to do that.
I looked things with pandas but I didn't manage to do anything that works
What I have is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
Doe;25/03/1958;garage;Missed;02/04/2019
What I want to get is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Doe;25/03/1958;garage;Missed;02/04/2019
So first you will have to open:
import pandas as pd
df1 = pd.read_csv('file1.csv',delimiter=';')
df2 = pd.read_csv('file2.csv',delimiter=';')
Treating the data frame, because of white spaces found
df1.columns= df1.columns.str.strip()
df2.columns= df2.columns.str.strip()
# Assuming only strings
df1 = df1.apply(lambda column: column.str.strip())
df2 = df2.apply(lambda column: column.str.strip())
The solution expected, Assuming that your name is UNIQUE.
Merging the files
new_merged_df = df2.merge(df1[['name','test status']],'left',on=['name'],suffixes=('','file1'))
DataFrame Result:
name DOB service test status test date test statusfile1
0 Smith 12/12/2012 compta Missed 01/01/2019 Missed
1 Smith 12/12/2012 compta Missed 01/01/2019 Passed
2 Doe 25/03/1958 garage Missed 02/04/2019 Missed
Filtering based on the requirements and removing the rows with the name with different test status.
filter = new_merged_df['test status'] != new_merged_df['test statusfile1']
# Check if there is different values
if len(new_merged_df[filter]) > 0:
drop_names = list(new_merged_df[filter]['name'])
# Removing the values that we don't want
new_merged_df = new_merged_df[~new_merged_df['name'].isin(drop_names)]
Removing columns and storing
# Saving as a file with the same schema as file2
new_merged_df.drop(columns=['test statusfile1'],inplace=True)
new_merged_df.to_csv('file2.csv',delimiter=';',index=False)
Result
name DOB service test status test date
2 Doe 25/03/1958 garage Missed 02/04/2019