I'm trying to display all the columns of a csv file. This is the file info.
File
And this is the code I'm using:
pd.options.display.max_colwidth = None
pd.options.display.max_columns = None
excel1 = pd.read_csv('CO-Chats1.csv', sep=';')
But when I read it, I get this.
Case Owner Resolved Date/Time Case Origin Case Number Status \
0 Reinaldo Franco 10/16/2021, 3:54 PM Chat 20546561 Resolved
1 Catalina Sanchez 10/16/2021, 5:38 AM Chat 5625033 Resolved
Subject
0 General Support
1 Support for payment
Not sure what causes the \ and then adding the following columns to the first one.
You should try to use display() instead of print() to see the output.
excel1 = pd.read_csv('CO-Chats1.csv', sep=';')
display(excel1)
I have two csv files with same columns name:
In file1 I got all the people who made a test and all the status (passed/missed)
In file2 I only have those who missed the test
I'd like to compare file1.column1 and file2.column1
If they match then compare file1.column4 and file2.column4
If they are different remove item line from file2
I can't figure how to do that.
I looked things with pandas but I didn't manage to do anything that works
What I have is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
Doe;25/03/1958;garage;Missed;02/04/2019
What I want to get is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Doe;25/03/1958;garage;Missed;02/04/2019
So first you will have to open:
import pandas as pd
df1 = pd.read_csv('file1.csv',delimiter=';')
df2 = pd.read_csv('file2.csv',delimiter=';')
Treating the data frame, because of white spaces found
df1.columns= df1.columns.str.strip()
df2.columns= df2.columns.str.strip()
# Assuming only strings
df1 = df1.apply(lambda column: column.str.strip())
df2 = df2.apply(lambda column: column.str.strip())
The solution expected, Assuming that your name is UNIQUE.
Merging the files
new_merged_df = df2.merge(df1[['name','test status']],'left',on=['name'],suffixes=('','file1'))
DataFrame Result:
name DOB service test status test date test statusfile1
0 Smith 12/12/2012 compta Missed 01/01/2019 Missed
1 Smith 12/12/2012 compta Missed 01/01/2019 Passed
2 Doe 25/03/1958 garage Missed 02/04/2019 Missed
Filtering based on the requirements and removing the rows with the name with different test status.
filter = new_merged_df['test status'] != new_merged_df['test statusfile1']
# Check if there is different values
if len(new_merged_df[filter]) > 0:
drop_names = list(new_merged_df[filter]['name'])
# Removing the values that we don't want
new_merged_df = new_merged_df[~new_merged_df['name'].isin(drop_names)]
Removing columns and storing
# Saving as a file with the same schema as file2
new_merged_df.drop(columns=['test statusfile1'],inplace=True)
new_merged_df.to_csv('file2.csv',delimiter=';',index=False)
Result
name DOB service test status test date
2 Doe 25/03/1958 garage Missed 02/04/2019
This is not the same question as double quoted elements in csv cant read with pandas.
The difference is that in that question: "ABC,DEF" was breaking the code.
Here, "ABC "DE" ,F" is breaking the code.
The whole string should be parsed in as 'ABC "DE", F'. Instead the inside double quotes are leading to the below-mentioned issue.
I am working with a csv file that contains the following type of entries:
header1, header2, header3,header4
2001-01-01,123456,"abc def",V4
2001-01-02,789012,"ghi "jklm" n,op",V4
The second row of data is breaking the code, with the following error:
ParserError: Error tokenizing data. C error: Expected 4 fields in line 1234, saw 5
I have tried playing with various sep, delimiter & quoting etc. arguments but nothing seems to work.
Can someone please help with this? Thank you!
Based on the two rows you have provided here is an option where the text file is read into a Series object and then regex extract is used via Series.str.extract() get the information you want in a DataFrame:
with open('so.txt') as f:
contents = f.readlines()
s = pd.Series(contents)
s now looks like the following:
0 header1, header2, header3,header4\n
1 \n
2 2001-01-01,123456,"abc def",V4\n
3 \n
4 2001-01-02,789012,"ghi "jklm" n,op",V4
Now you can use regex extract to get what you want into a DataFrame:
df = s.str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2}),([0-9]+),(.+),(\w{2})$')
# remove empty rows
df = df.dropna(how='all')
df looks like the following:
0 1 2 3
2 2001-01-01 123456 "abc def" V4
4 2001-01-02 789012 "ghi "jklm" n,op" V4
and you can set your columns names with df.columns = ['header1', 'header2', 'header3', 'header4']
Despite multiple attempts I am not succeeding in doing a simple merge operation on two dataframes. Below code returns
KeyError: 'CODE'
on the merge function.
Note 1: To make the post reproducible, StringIO is used here with only two lines within each CSV, but in real life I read from files with thousands of records.
Note 2: Notice the trailing ',' (separator) at the end of each line: my CSV files are badly formatted but this is how actual files are.
Note 3: I am using Python 2.7
from StringIO import StringIO
import pandas as pd
master = StringIO("""N-NUMBER,SERIAL NUMBER,MFR MDL CODE,ENG MFR MDL,YEAR MFR,TYPE REGISTRANT,NAME,STREET,STREET2,CITY,STATE,ZIP CODE,REGION,COUNTY,COUNTRY,LAST ACTION DATE,CERT ISSUE DATE,CERTIFICATION,TYPE AIRCRAFT,TYPE ENGINE,STATUS CODE,MODE S CODE,FRACT OWNER,AIR WORTH DATE,OTHER NAMES(1),OTHER NAMES(2),OTHER NAMES(3),OTHER NAMES(4),OTHER NAMES(5),EXPIRATION DATE,UNIQUE ID,KIT MFR, KIT MODEL,MODE S CODE HEX,
1 ,1071 ,3980115,54556,1988,5,FEDERAL AVIATION ADMINISTRATION ,WASHINGTON REAGAN NATIONAL ARPT ,3201 THOMAS AVE HANGAR 6 ,WASHINGTON ,DC,20001 ,1,001,US,20160614,19900214,1T ,5,5 ,V ,50000001, ,19880909, , , , , ,20191130,00524101, , ,A00001 ,""")
mfr = StringIO("""CODE,MFR,MODEL,TYPE-ACFT,TYPE-ENG,AC-CAT,BUILD-CERT-IND,NO-ENG,NO-SEATS,AC-WEIGHT,SPEED,
3980115,EXLINE ACE-C ,ACE-C ,4,1 ,1,1,01,001,CLASS 1,0082,""")
masterdf = pd.read_csv(master,sep=",",index_col=False)
mfrdf = pd.read_csv(mfr,sep=",",index_col=False)
masterdf.merge(mfrdataframe,left_on='MFR MDL CODE',right_on='CODE', how='inner')
I think that the problem is the name of the dataframe you're passing to merge: mfrdataframe should instead be mfrdf.
I have the following data frame:
str_value
0 Mock%20the%20Week
1 law
2 euro%202016
There are many such special characters such as %20%, %2520, etc..How do I remove them all. I have tried the following but the dataframe is large and I am not sure how many such different characters are there.
dfSearch['str_value'] = dfSearch['str_value'].str.replace('%2520', ' ')
dfSearch['str_value'] = dfSearch['str_value'].str.replace('%20', ' ')
You can use the urllib library and apply it using map method of a series.
Example -
In [23]: import urllib
In [24]: dfSearch["str_value"].map(lambda x:urllib.unquote(x).decode('utf8'))
Out[24]:
0 Mock the Week
1 law
2 euro 2016