error_bad_lines=False doesn't remove rows with extra columns - python

l have a csv file that l process with pandas. l have for columns as follow :
df.columns = ["id", "ocr", "raw_value", "manual_raw_value"]
However , l have some rows which have more than five columns . For instance :
id ocr raw_value manual_raw_value
2d704f42 OMNIPAGE remuneration rémunération hello
bfa6c9f14 OMNIPAGE 35470 35470
213e1e1e OMNIPAGE Echeance Echéance
l did the following in order not to read the rows with extra columns (like the first row)
df = pd.read_csv(filename, sep=",",index_col=None, error_bad_lines=False)
However the the rows with extra columns are kept.
Thank you

Another try. For easier indexing, I would rename columns, even those which are unnecessary:
df.columns = range(0, df.shape[1])
I assume, that empty places are NaN, so valid rows will have all NaN in other columns. I was not successful in searching for specific function, so I would interate through single columns and leave only those with NaN and pick only needed columns:
for i in range(4, df.shape[1]):
df = df[df.iloc[:,i].isnull()]
df = df[[0, 1, 2, 3]]
Then rename them how you want. Hope this will help.

Related

Take rows of ith duplicated index and put in i number of dataframes

This is a bit tricky to put into words, but I'll give it a try. I have a dataframe with duplicated indices as provided below.
a = [0.00000, 0.071928, 1.294, 2.592563, 0.000318, 2.575291, 0.439986, 2.232147, 6.091523, 2.075441, 0.96152]
b = [0.00000, 0.399791, 1.302446, 1.388957, 1.276451, 1.527568, 1.614107, 2.686325, 4.167600, 6.135689, 5.945807]
df = pd.DataFrame({'a' : a, 'b' : b})
df.index = [1,1,1,1,1,2,2,3,3,3,4]
I want the row of the first duplicated index for every number to be appended to df1, and the row of the second duplicated index to be appended to df2, etc; the first time indices 1, 2, 3, 4... n have a duplicate, those rows get appended to dataframe 1. The second time indices 1, 2, 3, 4...n have a duplicate, those rows get appended to dataframe 2, and so on. Ideally, it would look something like this if concatenated for the first three duplicates under the 'index' column:
Any idea how to go about this? I've tried to run df[df.duplicated(subset = ['index'])] in a for loop to widdle down the df to the very first duplicates, but it doesn't seem to work the way I think it will.
Slicing out the duplicate indices via cumcount and using concat to stitch together the resulting sub-dataframes will do the job.
cols = df.columns
df['id'] = df.index
pd.concat([df[df.groupby('id').cumcount()==i][cols] for i in range(0, max(df.groupby('id').cumcount().values))], axis=1)

Converting every other csv file column from python list to value

I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?
Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)
When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]

Keep other columns untouched

Im trying to divide some columns by fixed number (1000) and remove commas, also change mix type into int with the second last code line. Except the list of columns, I do have other columns that are being deleted after executing the code. How can I keep other columns?
df_1 = pd.read_excel(os.path.join(directory,'copy.xlsm'), sheet_name= "weekly",header= None)
df_1 = df_1.drop(df_1.columns[[0,1]], axis=1)
df_1.columns = df_1.loc[3].rename(None)
df_1 = df_1.drop(range(5))
columns =["A","B","D", "G"]
df_1=df_1.loc[:len(df_1) - 2, columns].replace(',', '', regex=True).apply(pd.to_numeric) / 1000
df_1.to_csv(directory+'new.csv', index=False, header= True)
Your problem is in this part:
df_1 = df_1.loc[...]...
You're overriding the original value of df_1 to a subset of your columns (and it seems that you're losing some rows too) when using this selector: [:len(df_1) - 2, columns]. You only need to update the values of that selection:
df_1.loc[...] = df_1.loc[...]...
By using loc as the target value to store the result of your operation, you're only modifying those rows and columns with the values where they should be.
Your code should contain this line instead (added for clarity):
df_1.loc[:len(df_1) - 2, columns] = df_1.loc[:len(df_1) - 2, columns].replace(',', '', regex=True).apply(pd.to_numeric) / 1000

Multiple columns with the same name in Pandas

I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.

import csv with different number of columns per row using Pandas

What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
You can dynamically generate column names as simple counters (0, 1, 2, etc).
Dynamically generate column names
# Input
data_file = "smallsample.txt"
# Delimiter
data_file_delimiter = ','
# The max column count a line in the file could have
largest_column_count = 0
# Loop the data lines
with open(data_file, 'r') as temp_f:
# Read the lines
lines = temp_f.readlines()
for l in lines:
# Count the column count for the current line
column_count = len(l.split(data_file_delimiter)) + 1
# Set the new most column count
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]
# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)
Missing values will be assigned to the columns which your CSV lines don't have a value for.
Polished version of P.S. answer is as follows. It works.
Remember we have inserted lot of missing values in the dataframe.
### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)
If you want something really concise without explicitly giving column names, you could do this:
Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame
df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("smallsample.txt",header = None,names=range(8))
Use range instead of manually setting names as it will be cumbersome when you have many columns.
You can use shantanu pathak's method to find longest row length in your data.
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
new_data = data.fillna(0)
We could even use pd.read_table() method to read csv file which converts it into type DataFrame of single columns which can be read and split by ','
Manipulate your csv and in the first row, put the row that has the most elements, so that all next rows have less elements. Pandas will create as much columns as the first row has.

Categories

Resources