Problem when processing from CSV to CSV with a row count - python

I am trying to process a CSV file into a new CSV file with only columns of interest and remove rows with unfit values of -1. Unfortunately I get unexpected results, as it automatically includes column 0 (old ID) into the new CSV file without explicitly asking the script to do it (as it is not defined in cols = [..]).
How could I change these values for the new row count. That for, when for example we remove row 9 with an id=9, the dataset id goes currently as [..7,8,10...] instead of a new id count as [..7,8,9,10...]. I hope anyone got a solution for it.
import pandas as pd
# take only specific columns from dataset
cols = [1, 5, 6]
data = pd.read_csv('data_sample.csv', usecols=cols, header=None) data.columns = ["url", "gender", "age"]
# remove rows from dataset with undefined values of -1
data = data[data['gender'] != -1]
data = data[data['age'] != -1]
""" Additional working solution
indexGender = data[data['gender'] == -1].index
indexAge = data[data['age'] == -1].index
# Delete the rows indexes from dataFrame
data.drop(indexGender,inplace=True)
data.drop(indexAge, inplace=True)
"""
data.to_csv('data_test.csv')
Thank you in advance.

I solved the problem via simple line after the data drop:
data.reset_index(drop=True, inplace=True)

Related

Converting every other csv file column from python list to value

I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?
Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)
When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Pulling two cols from csv

I have a csv file with 330k+ rows and 12 columns. I need to put column 1 (numeric ID) and column 3 (text string) into a list or array so I can analyze the data in column 3.
this code worked for me to pull out the third col:
for row in csv_strings:
string1.append(row[2])
Can someone point me to the correct class of commands that I can research to get the job done?
Thanks.
Pandas is the best tool for this.
import pandas as pd
df = pd.read_csv("filename.csv", usecols=[ 0, 2 ])
points = []
for row in csv_strings:
points.append({id: row[0], text: row[2]})
You can pull them out into a list of key value pairs.
A different answer, using tuples, which ensure immutability and are pretty fast, but less convenient than dictionaries:
# build results
results = []
for row in csv_lines:
results.append((row[0], row[2]))
# Read results
for result in results:
result[0] # id
result[1] # string
import csv
x,z = [],[]
csv_reader = csv.reader(open('Data.csv'))
for line in csv_reader:
x.append(line[0])
z.append(line[2])
This can help u getting data from 1st and 3rd column

Editing a specific cell with Pandas [Python]

I am having a problem with Pandas, have looked everywhere but think I am overlooking something.
I have a csv file I import to pandas, which has a ID column and another column I will call Column 2. I want to:
1. Input an ID to python.
2. Search this ID in the ID column with Pandas, and put a 1 on the adjacent cell, in Column 2.
import pandas
csvfile = pandas.read_csv('document1.csv')
#Convert everything to string for simplicity
csvfile['ID'] = csvfile['ID'].astype(str)
#Fill in all missing NaN
csvfile = csvfile.fillna('missing')
#looking for the row in which the ID '10099870.0' is in.
indexid = csvfile.loc[csvfile['ID'] == '10099870.0'].index
# Important part! I think this selects the column 2, row 'indexid' and replaces missing with 1.
csvfile['Column 2'][indexid].replace('missing', '1')
I know this is a simple question but thanks for all your help!
Mauricio
This is what I'd do:
cond = csvfile.ID == '10099870.0'
col = 'Column 2'
csvfile.loc[cond, col] = csvfile.loc[cond, col].replace('missing', '1')

import csv with different number of columns per row using Pandas

What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
You can dynamically generate column names as simple counters (0, 1, 2, etc).
Dynamically generate column names
# Input
data_file = "smallsample.txt"
# Delimiter
data_file_delimiter = ','
# The max column count a line in the file could have
largest_column_count = 0
# Loop the data lines
with open(data_file, 'r') as temp_f:
# Read the lines
lines = temp_f.readlines()
for l in lines:
# Count the column count for the current line
column_count = len(l.split(data_file_delimiter)) + 1
# Set the new most column count
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]
# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)
Missing values will be assigned to the columns which your CSV lines don't have a value for.
Polished version of P.S. answer is as follows. It works.
Remember we have inserted lot of missing values in the dataframe.
### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)
If you want something really concise without explicitly giving column names, you could do this:
Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame
df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("smallsample.txt",header = None,names=range(8))
Use range instead of manually setting names as it will be cumbersome when you have many columns.
You can use shantanu pathak's method to find longest row length in your data.
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
new_data = data.fillna(0)
We could even use pd.read_table() method to read csv file which converts it into type DataFrame of single columns which can be read and split by ','
Manipulate your csv and in the first row, put the row that has the most elements, so that all next rows have less elements. Pandas will create as much columns as the first row has.

Categories

Resources