How to create a new csv from a csv that separated cell - python

I created a function for convert the csv.
The main topic is: get a csv file like:
,features,corr_dropped,var_dropped,uv_dropped
0,AghEnt,False,False,False
and I want to conver it to an another csv file:
features
corr_dropped
var_dropped
uv_dropped
0
AghEnt
False
False
False
I created a function for that but it is not working. The output is same as the input file.
function
def convert_file():
input_file = "../input.csv"
output_file = os.path.splitext(input_file)[0] + "_converted.csv"
df = pd.read_table(input_file, sep=',')
df.to_csv(output_file, index=False, header=True, sep=',')

you could use
df = pd.read_csv(input_file)
this works with your data. There is not much difference though. The only thing that changes is that the empty space before the first delimiter now has Unnamed: 0 in there.
Is that what you wanted? (Still not entirely sure what you are trying to achieve, as you are importing a csv and exporting the same data as a csv without really doing anything with it. the output example you showed is just a formated version of your initial data. but formating is not something csv can do.)

Related

text file to csv conversion how to get ride of split lines in input file

I am trying to read a text file which has split lines randomly generated at column 28th from a third party.
When I conver to csv it is fine but, when I feed the files to Athena, it is not able to read because of split.
Is there a way to fine the CR here and put it back as other lines are?
Thanks,
SM
This is a code snippet :
import pandas as pd
add_columns = ["col1", "col2", "col3"...."col59"]
res = pd.read_csv("file_name.txt", names= add_columns, sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df = pd.DataFrame(res)
df.to_csv('final_name.csv', index = None)
file_name.txt
99,999,00499013,X701,,,5669,5669,1232,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1232,LXA,,<<line is split on column 28>>
2,5669,,,,68,,,1,,,,,,,,,,,,71,
99,999,00499017,X701,,,5669,5669,1160,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1160,LXA,,1,5669,,,,,,,1,,,,,,,,,,,,71,
99,999,00499019,X701,,,5669,5669,1284,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1284,LXA,,2,5669,,,,66,,,1,,,,,,,,,,,,71,
I have tried str.split but, no luck.
If you are able to convert it successfully to CSV using pandas, you can try to save it as a CSV to feed into Athena.

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

Issues with the delimiter when trying to read a comma separated file (Python, Pandas & .csv)

The problem:
I am trying to reproduce results from a youtube course of Keith Galli's.
import pandas as pd
import os
import csv
input_loc = "./SalesAnalysis/Sales_Data/"
output_loc = "./SalesAnalysis/korbi_output/"
fileList = os.listdir(input_loc)
all_months_data = pd.DataFrame()
problem probably starts here:
for file in fileList:
if file.endswith(".csv"):
df = pd.read_csv(input_loc+file)
all_months_data = all_months_data.append(df)
all_months_data.to_csv(output_loc+"all_months_data.csv")
all_months_data.head()
this is my output and I don't want row 1 to be displayed, because it contains no data:
The issue seems to be line 3 in one of my csv files. A3 is empty except for commas:
So I go to the csv file, and delete A3 cell. run the code again and I get this:
instead of this:
What do I have to do to remove the cells without value and to still display everything correctly?
I did not understand, WHY this weird problems occured, but I figured out a workaround to change the data and save everything in a new csv file:
all_months_data_cleaned = all_months_data.copy()
all_months_data_cleaned = all_months_data.dropna()
all_months_data_cleaned.reset_index(drop=True, inplace=True)
all_months_data_cleaned.to_csv(output_loc+"all_months_data_cleaned.csv")

How to only output the calculations done in the code into a csv file python?

So I am working on a code where I take values from the csv file and multiply them with some numbers but when I save and export the results the values from the imported file are also copied to the new file along with the results. I just want the results in the output file.
df = pd.read_csv('DAQ4.csv')
df['furnace_power'] = df['furnace_voltage']*df['furnace_current']*0.52 #calculating the furnace power
df['heat_pump_power'] = (df['pump_current']*230)*0.62
with open('DAQsol.csv', 'w', newline='') as f:
thewriter = csv.writer(f)
df.to_csv('DAQsol.csv')
This is not the full code but should be enough to understand. so basically I just want the heat pump power and the furnace power to appear in the output file not the whole pump current and voltage from the imported DAQ 4 file.
df['furnace_power'] = df['furnace_voltage']*df['furnace_current']*0.52 #calculating the furnace power
df['heat_pump_power'] = (df['pump_current']*230)*0.62
These two lines just modify the dataframe that you loaded. This means that all the other columns still exist, but just aren't modified. By calling df.to_csv('DAQsol.csv') you are saving the whole dataframe, with the unwanted and unmodified columns.
One way to not export these columns to the output .csv file, is to drop them out. This can be achieved with the following code:
df.drop(columns=['unwanted_column_1', 'unwanted_column_two'])
Just make an empty dataframe and populate it with your new calculations:
df = pd.read_csv('DAQ4.csv')
df2 = pd.DataFrame()
df2['furnace_power'] = df['furnace_voltage']*df['furnace_current']*0.52 #calculating the furnace power
df2['heat_pump_power'] = (df['pump_current']*230)*0.62
df2.to_csv('DAQsol.csv')
Hi here you have an example reading and saving in a diferent file after some calculations:
from os import sep
import pandas as pd
#reading the csv file as dataframe
df= pd.read_csv('DAQ4.csv',delimiter=';',header=None)
df['furnace_power'] = df['furnace_power'].apply(lambda x: x*x*0.52)
df['heat_pump_power'] = df['furnace_power'].apply(lambda x: x*230*0.62)
#saving the dataframe as csv file
df.to_csv('out_file.csv', sep=';', index=False)
I hope it works for you.

Get a non-blank cell recursively from previous columns of a csv using Python

I am new to both Python and Stack Overflow.
I extract from a csv file a few columns into an interim csv file and clean up the data to remove the nan entries. Once I have extracted them, I endup with below two csv files.
Main CSV File:
Sort,Parent 1,Parent 2,Parent 3,Parent 4,Parent 5,Name,Parent 6
1,John,,,Ned,,Dave
2,Sam,Mike,,,,Ken
3,,,Pete,,,Steve
4,,Kerry,,Rachel,,Rog
5,,,Laura,Mitchell,,Kim
Extracted CSV:
Name,ParentNum
Dave,Parent 4
Ken,Parent 2
Steve,Parent 3
Rog,Parent 4
Kim,Parent 4
What I am trying to accomplish is that I would like to recurse through main csv using the name and parent number. But, if I write a for loop it prints empty rows because it is looking up every row for the first value. What is the best approach instead of for loop. I tried dictionary reader to read scv but could not get far. Any help will be appreciated.
CODE:
import xlrd
import csv
import pandas as pd
print('Opening and Reading the msl sheet from the xlsx file')
with xlrd.open_workbook('msl.xlsx') as wb:
sh = wb.sheet_by_index(2)
print("The sheet name is :", sh.name)
with open(msl.csv, 'w', newline="") as f:
c = csv.writer(f)
print('Writing to the CSV file')
for r in range(sh.nrows):
c.writerow(sh.row_values(r))
df1 = pd.read_csv(msl.csv, index_col='Sort')
with open('dirty-processing.csv', 'w', newline="") as tbl_writer1:
c2 = csv.writer(tbl_writer1)
c2.writerow(['Name','Parent'])
for list_item in first_row:
for item in df1[list_item].unique():
row_content = [item, list_item]
c2.writerow(row_content)
Expected Result:
Input Main CSV:
enter image description here
In the above CSV, I would like to grab unique values from each column into a separate file or any other data type. Then also capture the header of the column they are taken from.
Ex:
Negarnaviricota,Phylum
Haploviricotina,Subphylum
...
so on
Next thing is would like to do is get its parent. Which is where I am stuck. Also, as you can see not all columns have data, so I want to get the last non-blank column. Up to this point everything is accomplished using the above code. So the sample output should look like below.
enter image description here

Categories

Resources