I'm writing a script to reduce a large .xlsx file with headers into a CSV, and then write a new CSV file with only the required columns based on the header names.
import pandas
import csv
df = pandas.read_csv('C:\\Python27\\Work\\spoofing.csv')
time = df["InviteTime (Oracle)"]
orignum = df["Orig Number"]
origip = df["Orig IP Address"]
destnum = df["Dest Number"]
df.to_csv('output.csv', header=[time,orignum,origip,destnum])
The error I'm getting is with that last bit of code, and it says
ValueError: Writing 102 cols but got 4 aliases
I'm sure I'm overlooking something stupid, but I've read over the to_csv documentation on the pandas website and I'm still at a loss. I know I'm misusing the to_csv parameters but I can't seem to get my head around the documentation.
Any help is appreciated, thanks!
The way to select specific columns is this -
header = ["InviteTime (Oracle)", "Orig Number", "Orig IP Address", "Dest Number"]
df.to_csv('output.csv', columns = header)
column_list=["column_name1", "column_name2", "column_name3", "column_name4"]
#filter the dataframe beforehand
ds[column_list].to_csv('output.csv',index=False)
#or use columns arg
ds.to_csv('output.csv', columns = column_list,index=False)
I provide index=False arg in order to write only column values
Related
I'm trying to read 100 CSVs and collate data from all into a single CSV.
I made use of :
all_files = pd.DataFrame()
for file in files :
all_files = all_files.append(pd.read_csv(file,encoding= 'unicode_escape')).reset_index(drop=True)
where files = list of filepaths of 100 CSVs
Now each CSV may have different number of columns. single CSV, each row may have different no. of colums too.
I want to match the column headers names, put the data from all the CSVs in the correct column, and keep on adding new columns to my final DF on the go.
The above code works fine for 30-40 CSVs and then breaks and gives the following error:
ParserError: Error tokenizing data. C error: Expected 16 fields in line 78, saw 17
Any help will be much appreciated!
There are a couple of ways to read variable length csv files -
First, you can specify the column names beforehand. If you are not sure of the number of columns, you can give a reasonably large number of columns
df = pd.read_csv(filename.csv, header=None, names=list(range(10)))
The other option is to read the entire file into a single column using a different delimiter - and then split on commas
df = pd.read_csv(filename.csv, header=None, sep='\t')
df = df[0].str.split(',', expand=True)
Its because you are trying to read all CSV files into a single Dataframe. When the first file is read number of columns for the DataFrame are decided and then it results in error when a different number of columns are fed. If you really want to concat them you should read them all in python, adjust their coulmns and then concat them
I'm trying to format an Amazon Vendor CSV using Pandas but I'm running into an issue. The issue stems from the fact that Amazon inserts a row with report information before the headers.
When trying to skip over that row when assigning headers to the dataframe, not all columns are captured. Below is my attempt at explicitly stating which row to pull columns from but it doesn't appear to be correct.
df = pd.read_csv(path + 'Amazon Search Terms_Search Terms_US.csv', sep=',', error_bad_lines=False, index_col=False, encoding='utf-8')
headers = df.loc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
print('Copying data into new data frame....')
Before it looks like this(I want row 2 to be all the columns in the new df:
After the fact it looks like this(it only selects 5):
I've also tried having it skiprows when opening the CSV, it doesn't treat the report row as data so it just ends up skipping actual data. Not really sure what is going wrong here, any help would be appreciated.
As posted in the comment by #suvayu, adding header=1 into the read csv did the job.
So I am writing in python using pandas. The code that I wrote extracts specific column headers from an excel file, which works, but I dont want to have to go into the code everytime to change out the name of the column headers to extract when working on new files with the same data.
Here is my extraction method:
xlsx = pd.ExcelFile('filepath')
df = pd.read_excel(xlsx, 'Tabelle1')
df2 = df[['ZONE_NAME','POLYGONSTRING']]
df2.to_csv('filepath\name', sep=';', index=False, header=True)
So When I run this code to another excel file I want it to accept any possible name for "ZONE_NAME" which could be "zonename", "Zone Name" etc...
If your problem is just limited to different ways to write a column like "ZONE_NAME" (e.g., "zone_name", "ZONENAME", "ZONE_name", etc.) then why not just use some type of filter on the column names:
xlsx = pd.ExcelFile('filepath')
df = pd.read_excel(xlsx, 'Tabelle1')
# This will filter out any non-alphabetical characters from each
# column name and lower it (so "ZONE_NAME" "ZONENAME" and "zone_NAME"
# would become "zonename")
filtered_columns = [re.sub('[^a-zA-Z]', "", c).lower() for c in list(df.columns.values)]
df.columns = filtered_columns
df2 = df[filtered_columns]
df2.to_csv('filepath\name', sep=';', index=False, header=True)
Hope this helps.
Pandas read_csv will automatically detect column headers. No need to specify anything ahead of time. Your post is lacking the links to any image, It would be better to post samples of the data in-line though.
Here's my problem, I have an Excel sheet with 2 columns (see below)
I'd like to print (on python console or in a excel cell) all the data under this form :
"1" : ["1123","1165", "1143", "1091", "n"], *** n ∈ [A2; A205]***
We don't really care about the Column B. But I need to add every postal code under this specific form.
is there a way to do it with Excel or in Python with Panda ? (If you have any other ideas I would love to hear them)
Cheers
I think you can use parse_cols for parse first column and then filter out all columns from 205 to 1000 by skiprows in read_excel:
df = pd.read_excel('test.xls',
sheet_name='Sheet1',
parse_cols=0,
skiprows=list(range(205,1000)))
print (df)
Last use tolist for convert first column to list:
print({"1": df.iloc[:,0].tolist()})
The simpliest solution is parse only first column and then use iloc:
df = pd.read_excel('test.xls',
parse_cols=0)
print({"1": df.iloc[:206,0].astype(str).tolist()})
I am not familiar with excel, but pandas could easily handle this problem.
First, read the excel to a DataFrame
import pandas as pd
df = pd.read_excel(filename)
Then, print as you like
print({"1": list(df.iloc[0:N]['A'])})
where N is the amount you would like to print. That is it. If the list is not a string list, you need to cast the int to string.
Also, there are a lot parameters that can control the load part of excel read_excel, you can go through the document to set suitable parameters.
Hope this would be helpful to you.
I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?
This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.
I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})
As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.
call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)
In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)