CSV File , Text to Column using Panda - python

def Text2Col(df_File):
for i in range(0,len(df_File)):
with open(df_File.iloc[i]['Input']) as inf:
with open(df_File.iloc[i]['Output'], 'w') as outf:
i=0
for line in inf:
i=i+1
if i==2 or i==3:
continue
outf.write(','.join(line.split(';')))
Above code is used to convert a csv file from text to column.
This code makes all values string ( because split() ) which is problematic for me.
I tried using map function but cant make it.
Is there any other way in which I can do this.
My input file has 5 columns, the first column is string, the second is int and the rest are float.
I think it required some modification in last statement
outf.write(','.join(line.split(';')))
Please let me know if any other input is required.

Ok, trying to help here. If this doesn't work, please specify in your question, what you're missing or what else needs to be done:
Use pandas to read in a csv file:
import pandas as pd
df = pd.read_csv('your_file.csv')
If you have a header on the first row, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0)
If you have a tab delimiter instead of a comma delimiter, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0, sep='\t')

Thank you !
Following Code worked:
def Text2Col(df_File):
for i in range(0,len(df_File)):
df = pd.read_csv(df_File.iloc[i]['Input'],sep=';')
df = df[df.index != 0]
df= df[df.index != 1]
df.to_csv(df_File.iloc[i]['Output'])
File_List="File_List.csv"
df_File=pd.read_csv(File_List)
Text2Col(df_File)
Input files are kept in same folder with same name as mentioned in File_List.xls
Output files will be created in same folder with separated in column. I deleted row 0 and 1 for my use. One can skip or add depending upon his requirement.
In above code df_file is dataframe contain two column list, first column is input file name and second column is output file name.

Related

How can I clean text in an Excel file with Python?

I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!
Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')
The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)
As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)

I have to extract all the rows in a .csv corresponding to the rows with 'watermelon' through pandas

I am using this code. but instead of new with just the required rows, I'm getting an empty .csv with just the header.
import pandas as pd
df = pd.read_csv("E:/Mac&cheese.csv")
newdf = df[df["fruit"]=="watermelon"+"*"]
newdf.to_csv("E:/Mac&cheese(2).csv",index=False)
I believe the problem is in how you select the rows containing the word "watermelon". Instead of:
newdf = df[df["fruit"]=="watermelon"+"*"]
Try:
newdf = df[df["fruit"].str.contains("watermelon")]
In your example, pandas is literally looking for cells containing the word "watermelon*".
missing the underscore in pd.read_csv on first call, also it looks like the actual location is incorrect. missing the // in the file location.

Why is row number added as a column in csv file saved by pandas?

I have two lists of strings named input_rems_text and input_text.I save them as a csv file.
import pandas as pd
df = pd.DataFrame()
df['A']=input_rems_text
df['B']=input_text
df.to_csv('MyLists.csv', sep="\t")
The output of df.shape is [10000,2]
The problem is when I read the csv file with the this code:
with open('MyLists.csv', 'r') as file:
for line_num, row in enumerate(csv.reader(file, delimiter='\t')):
print(len(row))
I get 3 as the row length. and when I print the row itself the row number is also present as a separate column in beginning of the row. What is my mistake? how can I dump csv file for two lists with just 2 columns?
Set index parameter to False on to_csv function.
df.to_csv('MyLists.csv', sep="\t", index=False)
Documentation
"Row numbers in CSV file" is called "row index". To suppress row index when you save CSV with df.to_csv, specify index=False.
Btw pandas has its own builtin pd.read_csv command for reading, so use it, no need to use base Python csv.reader as you're doing:
df2 = pd.read_csv('MyLists.csv', sep='\t')

pandas read csv then remove leading 0 then rewrtie CSV

Hey Guys I have a csv named info.csv
Number,Name 01,john 02,mike 010,kevin 012,joe
020,rob
I want to read in the csv using python pandas from my path remove the leading 0 and then rewrite it to a new csv named newinfo.csv. I have not been able to find any type of answer on SOF with this process.
When I import with Pandas it recognizes the first column as an integer and removes the leading zeros.
import pandas as pd
df = pd.read_csv("info.csv")
df.to_csv("newinfo.csv", index=False)
Or you could change the type to integer yourself.
df.Number = df.Number.astype(int)
Do you have any code you can show for this? What have you tried so far?
You can do something like:
with open('your_file.csv', 'rb') as f:
data = f.read()
And then slice the first element to remove the 0
Cheers!

Read the last column from an Excel file with pandas

It's like how to read certain columns from Excel using Pandas - Python but a little bit more complicated.
Say I have an Excel file called "foo.xlsx" and it grows over time - a new column will be appended on the right every month. However, when I read it, I need only the first two and the last columns. I expected usecols parameter can solve this problem so I went df = pd.read_excel("foo.xlsx", usecols=[0, 1, -1]) but it gives me only the first two columns.
My workaround turns out to be:
df = pd.read_excel("foo.xlsx")
df = df[df.columns[[0, 1, -1]]]
But it needs reading the whole file every time. Is there any way that I can get my desired data frame while reading the file? Thanks.
If you really want to do this (see my comment above) you could to this:
xl = pd.ExcelFile(file)
ncols = xl.book.sheets()[0].ncols
df = xl.parse(0, usecols=[0, 1, ncols-1])
This solution won't read the excel file twice.
One idea is get column count and pass to usecols:
from openpyxl import load_workbook
path = "file.xlsx"
wb = load_workbook(path)
sheet = wb.worksheets[0]
column_count = sheet.max_column
print (column_count)
Or read only first row of file:
column_count = len(pd.read_excel(path, nrows=0).columns)
df = pd.read_excel(path, usecols=[0, 1, column_count-1])
print (df)
You can use df.head() and df.tail() to read the first 2 and last line. For example:
df = pd.read_excel("foo.xlsx", sheet_name='ABC')
#print the first 2 column
print(df.head(2))
#print the last column
print(df.tail(1))
EDIT: Oops the above code reads rows and not columns. Yes, you have to read the file everytime. I don't think there's an option to read partial file.
For reading column maybe you can do something like this
df['Column Name'][index]

Categories

Resources