Tab seperated CSV file exported in only one column - python

I'm new to python and want to import some csv data with pandas. The data is seperated with tabs ('\t').
The CSV file has 35339 rows and 23 columns.
The data read in over pandas works as expected but if I try to visualize the data it says that the read in data has only 35339 rows and 1 column. Even if the data seems to be extracted correctly in the console over the print command, it seems that only one column but all rows are exported.
I tried several different options on how to import data over pandas. I also just the csv reader but did not get the expected result.Here is a snapshot of the data.
import pandas as pd
import glob
for filename in glob.glob('*.csv'):
print(filename)
sensor_df = pd.read_csv(filename, sep='\t',low_memory=False)
print(sensor_df)
The output is
[35338 rows x 1 columns]
expected is
[35338 rows x 23 columns]

Try skiprows, it looks like the first row is a comment about the separator.
sensor_df = pd.read_csv(filename, sep='\t', low_memory=False, skiprows=1)

Related

Pandas to_csv writing extra " on column containing raw json

I'm attempting to save my dataframe as csv after processing my data.
One caveat is I have a column containing the 'raw json' of the file as well.
When pandas saves the file using to_csv(header=False), I get the following
1,2,"{""col_1"":""1"",""col_2"":""1""}"
My dataframe looks like this:
col_1
col_2
raw_json
1
1
{"col_1":1,"col_2":1}
I've tried adding the json col something like:
for i, row in df:
i_val = row.to_json()
df.at[i,'raw_json'] = i_val
Expected csv:
1,2,{"col_1":"1","col_2":"1"}
You could use something like this:
import csv
import pandas as pd
df.to_csv('output.csv', index=False, header=False, quoting=csv.QUOTE_NONE, sep=';')
As #pranav-hosangadi was explaining:
"CSV format uses quotes to escape fields that themselves contain the
separator"
So when you set quoting=csv.QUOTE_NONE you disable that behavior and nothing will be quoted.
Important:
Note that the separator of the csv will be ";" in this case, so you'll need to be sure that your fields not contains";" characters that could broke your csv

Pandas (pd.concat) puts each dataframe into a different column

I'm attempting the following workflow:
Import large .csv file and isolate by column.
Remove duplicates
Import Second, smaller .csv file and isolate by column.
Concatenate the two into a single column in a dataframe and remove the duplicates so I can check the diff.
Here's the code I have so far.
import pandas as pd
df1 = pd.read_csv('file.csv', header=0, usecols=[4]).drop_duplicates()
# Place a checkpoint here to verify the file wrote correctly.
df1.to_csv('orders_new.csv', index=False, header=None, encoding='utf-8')
df2 = pd.read_csv('file2.csv', header=0, usecols=[0])
# Place a checkpoint here to verify the file wrote correctly.
df2.to_csv('tables_new.csv', index=False, header=None, encoding='utf-8')
new_dataframe = pd.concat([df1, df2], ignore_index=True).drop_duplicates(keep=False)
new_dataframe.to_csv('output.csv')
When I look in the output.csv file, file1 has been properly imported and dupes removed and dropped into the new file, but when I get to the point where that file ends and the next begins, the file 2 data ends up in the row to the right of it.
Looking for some guidance here.
Thanks!
I tried doing a simple concat within pandas, and it worked but the output was in two columns, and not one.

How can I skip the first line in CSV files imported into a pandas df but keep the header for one of the files?

I essentially want to preserve the header for one of the csv files to make them the column names in the csv but for the rest of the files I want to skip the header. Is there an easier solution to doing this except for the following:
import as no headers, then change column names after all csv files are imported and deleted duplicate rows from df.
My current code is:
import glob
import pandas as pd
import os
path = r"C:\Users\..."
my_files = glob.glob(os.path.join(path, "filename*.xlsx"))
file_li = []
for filename in my_files:
df = pd.read_excel(filename, index_col=None, header=None)
file_li.append(df)
I am trying to append 365 files into one based on the condition that the file name meets the above criteria. The files looks like this:
Colunn1
Colunn2
Colunn3
Colunn4
Colunn5
Colunn6
Colunn7
Colunn8
Colunn9
Colunn10
Colunn11
2
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
3
4
5
6
7
I want to keep the column names (column1, 2.,) for the first file but then skip it for the rest so I dont have to reindex it or change the df after. The reason for this is I dont want to have duplicate rows with column headers in the DF or have missing headers...is this complicating an easier solution?
Why are you putting them in a list?
Pandas concat lets you combine DF's while doing the column name management for you.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Why is row number added as a column in csv file saved by pandas?

I have two lists of strings named input_rems_text and input_text.I save them as a csv file.
import pandas as pd
df = pd.DataFrame()
df['A']=input_rems_text
df['B']=input_text
df.to_csv('MyLists.csv', sep="\t")
The output of df.shape is [10000,2]
The problem is when I read the csv file with the this code:
with open('MyLists.csv', 'r') as file:
for line_num, row in enumerate(csv.reader(file, delimiter='\t')):
print(len(row))
I get 3 as the row length. and when I print the row itself the row number is also present as a separate column in beginning of the row. What is my mistake? how can I dump csv file for two lists with just 2 columns?
Set index parameter to False on to_csv function.
df.to_csv('MyLists.csv', sep="\t", index=False)
Documentation
"Row numbers in CSV file" is called "row index". To suppress row index when you save CSV with df.to_csv, specify index=False.
Btw pandas has its own builtin pd.read_csv command for reading, so use it, no need to use base Python csv.reader as you're doing:
df2 = pd.read_csv('MyLists.csv', sep='\t')

how to read excel file with nested columns with pandas?

I am trying to read an Excel file using pandas but my columns and index are changed:
df = pd.read_excel('Assignment.xlsx',sheet_name='Assignment',index_col=0)
Excel file:
Jupyter notebook:
By default pandas consider first row as header. You need to tell that take 2 rows as header.
df = pd.read_excel("xyz.xlsx", header=[0,1], usecols = "A:I", skiprows=[0])
print df
You can choose to mention skiprows depending on the requirement. If you remove skiprows, it will show first row header without any unnamed entries.
Refer this link

Categories

Resources