How to change read_csv handling of empty values

How to change read_csv handling of empty values - python

When loading a header with missing values, pandas' read_csv creates a name like Unnamed: 0_level_1. How would I do to replace these with empty strings?
import pandas as pd
file = """A,B,C,C
,,C1,C2
1,2,3,4
5,6,7,8
"""
with open('test.csv', 'w') as f:
f.write(file)
df = pd.read_csv('test.csv', header=[0, 1])
print(df.columns)

You can use built-in rename, something like:
data.rename( columns={0:'whatever you want'}, inplace=True )
More info https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In the process of CSV downloading You can put yours names of columns.
uruser_cols = ['A', 'B', 'C', 'D', 'E']
users = pd.read_csv('C:/2/mlotek.csv', header=None, names=uruser_cols)
print(users.head())

Related

Re-ordering columns in a csv but with Dictionaries broken

I have a code that is braking because I am trying to reorganize columns but also ignoring other columns on the output csv file.
Input csv file:
book1.csv
A,B,C,D,E,F
a1,b1,c1,d1,e1,F1
a1,b1,c1,d1,e1,F1
a1,b1,c1,d1,e1,
a1,b1,c1,d1,e1,F1
a1,b1,c1,d1,e1,
My code:
import csv
order_of_headers_should_be = ['A', 'C', 'D', 'E', 'B']
dictionary = {'A':'X1','B':'Y1','C':'U1','D':'T1','E':'K1'}
new_headers = [dictionary[old_header] for old_header in order_of_headers_should_be]
with open('Book1.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
# output dict needs a list for new column ordering
writer = csv.DictWriter(outfile, fieldnames = new_headers)
# reorder the header first
writer.writeheader()
for row in csv.DictReader(infile):
new_row = {dictionary[old_header]: row[old_header] for old_header in row}
writer.writerow(new_row)
my current output is only the headers (but they are in the correct order):
X1,U1,T1,K1,Y1
Getting an KeyError: 'F'
But I need it to also output so it will look like this:
reordered.csv
X1,U1,T1,K1,Y1
a1,c1,d1,e1,b1
a1,c1,d1,e1,b1
a1,c1,d1,e1,b1
a1,c1,d1,e1,b1
a1,c1,d1,e1,b1

When old_header is F you'll get a KeyError, so the for row loop will stop and you won't get any data rows in the output file.
Add a check for this to ther dictionary comprehension.
new_row = {dictionary[old_header]: value for old_header, value in row.items() if old_header in dictionary}
You could also loop through dictionary instead of row.
new_row = {new_header: row[old_header] for old_header, new_header in dictionary}

Here is a simpler way using pandas to do the heavy lifting.
import pandas as pd
# Read CSV file into DataFrame df
df = pd.read_csv('Book1.csv')
# delete F column
df = df.drop('F', axis=1)
# rename columns
df.columns = ['X1', 'Y1', 'U1', 'T1', 'K1']
# write to file in desired order
df.to_csv('book_out.csv', index=False,
columns=['X1', 'U1', 'T1', 'K1', 'Y1'])

Import csv with inconsistent count of columns per row with original header use pandas

please how can I read csv of that type and keep original columns names? Maybe add some generic column names to the end of the header, depending on the max number of columns in the body of csv...
a,b,c
1,2,3
1,2,3,
1,2,3,4
Simple read_csv does not work:
tempfile = pd.read_csv(path
,index_col=None
,sep=','
,header=0
,error_bad_lines=False
,encoding = 'unicode_escape'
,warn_bad_lines=True
)
b'Skipping line 3: expected 3 fields, saw 4\nSkipping line 4: expected 3 fields, saw 4\n'
I need that type of result:
a,b,c,x1
1,2,3,NA
1,2,3,NA
1,2,3,4

One approach would be to first read just the header row in and then pass these column names with your extra generic names as a parameter to pandas. For example:
import pandas as pd
import csv
filename = "input.csv"
with open(filename, newline="") as f_input:
header = next(csv.reader(f_input))
header += [f'x{n}' for n in range(1, 10)]
tempfile = pd.read_csv(filename,
index_col=None,
sep=',',
skiprows=1,
names=header,
error_bad_lines=False,
encoding='unicode_escape',
warn_bad_lines=True,
)
skiprows=1 tells pandas to jump over the header and names holds the full list of column headers to use.
The header would then contain:
['a', 'b', 'c', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']

Python: How can I split the string properly to save in the format of csv?

I'm trying to save my string in the format of csv.
The string looks like this, line separated by '\n',:
12,12,11,13,11,12
21,15,21,23,41,26
34,16,46,17,21,15
44,17,22,39,10,13
and so on. Also I have a manually written list of headers like
['A', 'B', 'C', 'D', 'E', 'F']
When I tried to write this using the csv writer,
with open('output.csv', 'w', newline='') as csvwriter:
writer = csv.writer(csvwriter, dialect='excel')
writer.writerow(header) # write header
for r in output.splitlines():
writer.writerows(r)
csvwriter.close()
But when I look up the output file,
A,B,C,D,E,F
1
2
","
1
2
","
1
1
","
... (and so on)
Why this happens and how can I fix this? I appreciate any help.

if your string is like this:
string = '''12,12,11,13,11,12
21,15,21,23,41,26
34,16,46,17,21,15
44,17,22,39,10,13'''
headers = ['A', 'B', 'C', 'D', 'E', 'F']
without any library:
with open('untitled.txt', 'w') as f:
f.write(','.join(headers)+'\n')
for line in string:
f.write(line)
You can make it into a pandas csv and then save:
import pandas as pd
data = []
for i in string.split('\n'):
data.append(i.split(','))
csv = pd.DataFrame(data=data, columns=headers)
csv.to_csv('path_to_save', index=False)

Python 3.6: rename column header using DictWriter

my python script currently uses DictWriter to read a csv file, re-arrange the columns and write to a new output csv file. The input CSV file has the following columns:
A;B;C;D
which will be transferred to:
B,C;A;D
Additionally, I would like to rename one of the header. I already tried 2 approaches:
1.) create a new writer object and use the `writer' method. However, this simply puts all the given fieldnames in the very first columns:
newHeader = csv.writer(outfile)
newFN = ['B', 'C', 'Renamed', 'D']
newHeader.writerow(newFN)
the output is:
B,C,Renamed,D;;;
2.) Using the existing DictWriter object I define a new list of column headers and iterate over it:
newHeader = ['B', 'C', 'Renamed', 'D']
writer.writerow(dict((fn, fn) for fn in newHeader))
This time however, the renamed column header remains empty in the output CSV.

Your can use a dictionary to rename columns and csv.writer to write values from reordered OrderedDict objects:
from io import StringIO
from collections import OrderedDict
import csv
mystr = StringIO("""A;B;C;D
1;2;3;4
5;6;7;8""")
order = ['B', 'C', 'A', 'D']
# define renamed columns via dictionary
renamer = {'C': 'C2'}
# define column names after renaming
new_cols = [renamer.get(x, x) for x in order]
# replace mystr as open(r'file.csv', 'r')
with mystr as fin, open(r'C:\temp\out.csv', 'w', newline='') as fout:
# define reader / writer objects
reader = csv.DictReader(fin, delimiter=';')
writer = csv.writer(fout, delimiter=';')
# write new header
writer.writerow(new_cols)
# iterate reader and write row
for item in reader:
writer.writerow([item[k] for k in order])
Result:
B;C2;A;D
2;3;1;4
6;7;5;8

How to change a special character in a csv file before uploading into a dataframe in python with pandas

So I have a large data set where the strings are surrounded with " symbols but in some cases they have been replaced with ” symbols, the issue with that is, it's causing pandas to think my separators are part of the elements therefore joining to elements together.
I'm hoping to find a way to replace the ” symbol with " without fixing up the csv file since the code will be used by someone who will most likely have a version of the dataset with those special characters and won't be fixing it up.
The dataset is quite large (21 columns, over 4000 rows) but here is a small example with the issue:
"a";"b";"c";"d";"e";"f"
30;"yes";4.4;"Monday";"no";"yes"
39;"no";3.4;"Tuesday";"no";"no"
47;"no";2.1;"Tuesday”;”no";"yes”
25;"yes";4.5;"wednesday";"no";"yes"
Below is the code I've been trying to use but I can't seem to get that to work:
import pandas as pd
from StringIO import StringIO
dataset = 'datafile.csv'
sio = StringIO(dataset)
v = sio.getvalue()
v = v.replace('”',"",)
sio.write(dataset)
df = pd.read_csv(dataset, sep=';', decimal='.', skiprows = 1, header= None, names=['a', 'b', 'c', 'd', 'e', 'f'])

str.replace returns the updated string. You should wrap that in a StringIO and then pass it to pandas. Your code appears to be python 2, but python 3 is much better at unicode issues. Here is a python 3 solution that works on the example data set:
import pandas as pd
from io import StringIO
dataset = 'datafile.csv'
sio = StringIO(open(dataset).read().replace('”','"'))
df = pd.read_csv(sio, sep=';', decimal='.', skiprows = 1, header= None, names=['a', 'b', 'c', 'd', 'e', 'f'])
print(df)
Here is a solution that works for python 2.7. I am assuming that the CSV file is UTF-8 encoded and that you are using a UTF-8 enabled editor to write the python script. That's normal for unixy systems but may be problematic on windows.
# coding=utf-8
import pandas as pd
from StringIO import StringIO
import codecs
dataset = 'datafile.csv'
sio = StringIO(codecs.open(dataset, encoding="utf-8").read().replace(u'”',u'"'))
df = pd.read_csv(sio, sep=';', decimal='.', skiprows = 1, header= None, names=['a', 'b', 'c', 'd', 'e', 'f'])
print(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to change read_csv handling of empty values - python

You can use built-in rename, something like: data.rename( columns={0:'whatever you want'}, inplace=True ) More info https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In the process of CSV downloading You can put yours names of columns. uruser_cols = ['A', 'B', 'C', 'D', 'E'] users = pd.read_csv('C:/2/mlotek.csv', header=None, names=uruser_cols) print(users.head())

Related

Re-ordering columns in a csv but with Dictionaries broken

Import csv with inconsistent count of columns per row with original header use pandas

Python: How can I split the string properly to save in the format of csv?

Python 3.6: rename column header using DictWriter

How to change a special character in a csv file before uploading into a dataframe in python with pandas

Categories

Resources