Pandas ignoring cells with " and , - python

I have a semicolon-delimited pandas DataFrame with all dtypes of object. Within some of the cells the string value can have ", a comma (,), or both (ex. TES"T_ING,_VALUE). I am then querying the DF using df.query based on some condition to get a subset of the DataFrame but the rows that have the pattern described in the example are being omitted completely but the remaining rows are being returned just fine. Another requirement is that I need to match all " within the text with a closing quote as well but applying a lambda to replace " with "" is also not being done properly. I have tried several methods and they are listed below
Problem 1:
pd.read_csv("file.csv", delimiter=';')
pd.read_csv("file.csv", delmiter=';', thousands=',')
pd.read_csv("file.csv", delimiter=";", escapechar='"')
pd.read_csv("file.csv", delimiter=";", encoding='utf-8')
All of the above fail to load the data in question.
Problem 2:
Input: TES"T_ING,_VALUE to TES""T_ING,_VALUE
I have tried:
df.apply(lambda s: s.str.replace('"', '""')
which doesn't do anything.
What exactly is going on? I haven't been able to find any questions tackling this particular type of issue anywhere.
Appreciate your help in advance.
EDIT: Sorry I didn't provide some mockup data due to sensitivity but here is some fake data that illustrates the issue
The following is a sample of how the csv structure
Column1;Column2;Column3;Column4;Column5\n
TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value\n
Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value\n
I have tried utilizing quoting=csv.QUOTE_ALL/QUOTE_NONNUMERIC and quotechar='"' when loading in the df but the result ends up being
Column1;Column2;Column3;Column4;Column5\n
"TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value";;;;\n
"Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value";;;;\n
So it interprets the whole row as value in column 1 rather than actually splitting on the ; and applying the quoting to only column1. Truthfully I can iterate through each row in the df and maybe do a split and load the remaining values into their respective column but the CSV is quite large so this operation would take sometime. The subset of the data the user queries on is supposed to be returned from an endpoint (this part is already working).

The problem was solved utilizing pd.apply and utilizing a custom function to process each record.
df = pd.read_csv("csv_file.csv", delimiter=';', escapechar='\\')
def mapper(record):
if ';' in record['col1']:
content = record['col1'].split(';')
if len(content) == num_columns:
if '"' in content[0]:
content[0] = content[0].replace('"', '""')
record['col1'] = content[0]
# repeat for remaining columns
processed = df.apply(lambda x: mapper(x), axis=1)

Related

Pandas skipping certain columns

I'm trying to format an Amazon Vendor CSV using Pandas but I'm running into an issue. The issue stems from the fact that Amazon inserts a row with report information before the headers.
When trying to skip over that row when assigning headers to the dataframe, not all columns are captured. Below is my attempt at explicitly stating which row to pull columns from but it doesn't appear to be correct.
df = pd.read_csv(path + 'Amazon Search Terms_Search Terms_US.csv', sep=',', error_bad_lines=False, index_col=False, encoding='utf-8')
headers = df.loc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
print('Copying data into new data frame....')
Before it looks like this(I want row 2 to be all the columns in the new df:
After the fact it looks like this(it only selects 5):
I've also tried having it skiprows when opening the CSV, it doesn't treat the report row as data so it just ends up skipping actual data. Not really sure what is going wrong here, any help would be appreciated.
As posted in the comment by #suvayu, adding header=1 into the read csv did the job.

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

reading a multi-indexed CSV in pandas with multiple delimiters

I'm trying to create a very human-readable script that will be multi-indexed. It looks like this:
A
one : some data
two : some other data
B
one : foo
three : bar
I'd like to use pandas' read_csv to automatically read this in as a multi-indexed file with both \t and : used as delimiters so that I can easily slice by section (i.e., A and B). I understand something like that header=[0,1] and perhaps tupleize_cols may be used to this end, but I can't get that far since it doesn't seem to want to read both the tabs and colons properly. If I use sep='[\t:]', it consumes the leading tabs. If I don't use the regexp and read with sep='\t', it gets the tabs right, but doesn't handle the colons. Is this possible using read_csv? I could do it line by line, but there must be an easier way :)
This is the output I had in mind. I added labels to the indices and column, which could hopefully be applied when reading it in:
value
index_1 index_2
A one some data
two some other data
B one foo
three bar
EDIT: I used part of Ben.T's answer to get what I needed. I'm not in love with my solution since I'm writing to a temp file, but it does work:
with open('temp.csv','w') as outfile:
for line in open(reader.filename,'r'):
if line[0] != '\t' or not line.strip():
index1 = line.split('\n')[0]
else:
outfile.write(index1+':'+re.sub('[\t]+','',line))
pd.read_csv('temp.csv', sep=':', header=None, \
names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])
You can use two delimiters in read_csv such as:
pd.read_csv( path_file, sep=':|\t', engine='python')
Note the engine='python' to prevent a warning.
EDIT: with your input format it seems difficult, but with input like:
A one : some data
A two : some other data
B one : foo
B three : bar
with a \t as delimiter after A or B, then you get a multiindex by:
pd.read_csv(path_file, sep=':|\t', header = None, engine='python', names = ['index_1', 'index_2', 'Value'] ).set_index(['index_1', 'index_2'])

pandas.read_csv not partitioning data at semicolon delimiter

I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.
Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe
The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')

Categories

Resources