Delete rows in CSV based on specific value - python

I want to delete specific rows in my CSV with Python. The CSV has multiple rows and columns.
import numpy as np
np.df2 = pd.read_csv('C:/Users/.../Data.csv', delimiter='\t')
np.df_2=np.df2[['Colum.X', 'Colum.Y']]
Python should open Data.csv and then delete every (complete) rows where the value of Colum.X > 5 or the value of Colum.Y > 20 in Data.csv.

You can accomplish this with Pandas, no need for Numpy. I assume the columns in your csv are actually named 'Colum.X' and 'Colum.Y'.
import pandas as pd
df = pd.read_csv('C:/Users/.../Data.csv', delimiter='\t')
df = df.loc[df['Colum.X'] <= 5] # Take only the rows where Colum.X <= 5
df = df.loc[df['Colum.Y'] <= 20] # Take only the rows where Colum.Y <= 20
df.to_csv('C:/Users/.../Data.csv', index=False) # Export back to csv (with comma's)

Not entirely sure what you're doing with np.df2, but the following will work:
import pandas as pd
df = pd.read_csv('C:/Users/.../Data.csv', delimiter='\t')
df2 = df[(df['X'] <= 5) & (df['Y'] <= 20)]
You might have to add columns=['X', 'Y'] to the read_csv call, depending on what your CSV data looks like.
You can then overwrite the original file with:
df2.to_csv('C:/Users/.../Data.csv')

Related

Python : Split 1 Excel File into multiple Excel files by rows

For example u have 1 excel file and it consist of 10000 data in it. Later when we import that excel file in pycharm or jupiter notebook. If i run that file i will get an Index range also know as Row labels. my python code should be able to read that ten thousand row labels and should be able to separate / split into 10 different excel sheet files which will have 1000 data in each of the 10 separated sheet.
Other example is, if there is 9999 data in 1 sheet/file then my python code should divide 9000 data in 9 sheet and other 999 in other sheet without any mistakes.{This is important Question}
i am asking this because in my data there is not any unique values for my code to split the files using .unique
You could use Pandas to read your file, chunk it then re-write it :
import pandas as pd
df = pd.read_excel("/path/to/excels/file.xlsx")
n_partitions = 3
for i in range(n_partitions):
sub_df = df.iloc[(i*n_paritions):((i+1)*n_paritions)]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
EDIT:
Or if you prefere to set the number of lines per xls files :
import pandas as pd
df = pd.read_excel("/path/to/excels/file.xlsx")
rows_per_file = 4
n_chunks = len(df) // rows_per_file
for i in range(n_chunks):
start = i*rows_per_file
stop = (i+1) * rows_per_file
sub_df = df.iloc[start:stop]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
if stop < len(df):
sub_df = df.iloc[stop:]
sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
You'll need openpyxl to read/write Excel files
the following code snippet is working fine for me
import pandas as pd
import openpyxl
import math
data = pd.read_excel(r"path_to_excel_file.xlsx")
_row_range = 200
_block = math.ceil(len(data)/_row_range )
for x in range(_block,_row_range ):
startRow = x*_row_range
endRow = (x+1)*_row_range
_data = data.iloc[startRow:endRow]
_data.to_excel(f"file_name_{x}.xlsx",sheet_name="Sheet1",index=False)
This gets the job done as well. Assumes the Excels file would be 19000 rows per file. Edit that to suit your scenario.
import pandas as pd
import math
data = pd.read_excel(filename)
count = len(data)
rows_per_file = 19000
no_of_files = math.ciel(count/rows_per_file)
start_row = 0
end_row = rows_per_file
for x in range(no_of_files):
new_data = data.iloc(start_row:end_row)
newdata.to_excel(f"filename_{x}.xlsx")
start_row end_row + 1
end_row = end_row + rows_per_file

How to split CSV column data into two colums in Python

I have the following code (below) that grabs to CSV files and merges data into one consolidated CSV file.
I now need to grab specific information from one of the columns add that information to another column.
What I have now is one output.csv file with the following sample data:
ID,Name,Flavor,RAM,Disk,VCPUs
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2
What I need to do is open this CSV file and split the data in the Name column across two columns as followed:
ID,Name,Flavor,RAM,Disk,VCPUs,Customer,Misc
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2,customer1,test1-dns
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1,customer2,test2
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2,customer3,test3-dns-api
Note how the Misc column can have multiple values split by one or multiple -.
How can I accomplish this via Python. Below is the code I have now:
import csv
import os
import pandas as pd
by_name = {}
with open('flavor.csv') as b:
for row in csv.DictReader(b):
name = row.pop('Name')
by_name[name] = row
with open('output.csv', 'w') as c:
w = csv.DictWriter(c, ['ID', 'Name', 'Flavor', 'RAM', 'Disk', 'VCPUs'])
w.writeheader()
with open('instance.csv') as a:
for row in csv.DictReader(a):
try:
match = by_name[row['Flavor']]
except KeyError:
continue
row.update(match)
w.writerow(row)
Try this:
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
df
Output:
ID Name Flavor RAM Disk VCPUs Customer Misc
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2 customer1 test1-dns
1 83dbc739-e436-4c9f-a561-c5b40a3a6da5 customer2-test2 m1.tiny 128 1 1 customer2 test2
2 ef68fcf3-f624-416d-a59b-bb8f1aa2a769 customer3-test3-dns-api m1.medium 4096 40 2 customer3 test3-dns-api
I would recommend switching over to pandas. Here's the official Getting Started documentation.
Let's first read in the csv.
import pandas as pd
df = pd.read_csv('input.csv')
print(df.head(1))
You should get something similar to:
ID Name Flavor RAM Disk VCPUs
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2
After that, use string manipulation in the Pandas Series:
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
Finally, you can save the csv.
df.to_csv('output.csv')
This code would be much elegant and simpler if you used pandas.
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df['Name'].str.split(pat='-',n=1,expand=True)
df.to_csv('output.csv',index=False)
Documentation ref
Here is how I do it. The trick is in the function "split()" :
import pandas as pd
file = pd.read_csv(r"C:\...\yourfile.csv",sep=",")
file['Customer']=None
file['Misc']=None
for x in range(len(file)):
temp=file.Name[x].split('-', maxsplit=1)
file['Customer'].iloc[x] = temp[0]
file['Misc'].iloc[x] = temp[1]
file.to_csv(r"C:\...\yourfile_result.csv")

write several txt to directory automatically

We have this if else iteration with the goal to split a dataframe into several dataframes. The result of this iteration will vary, so we will not know how much dataframes we will get out of a dataframe. We want to save that several dataframe as text (.txt):
txtDf = open('D:/My_directory/df0.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df1.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df2.txt', 'w')
txtDf.write(df0)
txtDf.close()
And so on ....
But, we want to save that several dataframes automatically, so that we don't need to write the code above for 100 times because of 100 splitted-dataframes.
This is the example our dataframe df:
column_df
237814
1249823
89176812
89634
976234
98634
and we would like to split the dataframe df to several df0, df1, df2 (notes: each column will be in their own dataframe, not in one dataframe):
column_df0 column_df1 column_df2
237814 89176812 976234
1249823 89634 98634
We tried this code:
import copy
import numpy as np
df= pd.DataFrame(df)
len(df)
if len(df) > 10:
print('EXCEEEEEEEEEEEEEEEEEEEEDDD!!!')
sys.exit()
elif len(df) > 2:
df_dict = {}
x=0
y=2
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 2
y += 2
df_name
else:
df
for df_ in df_dict:
print(df_)
print(f'length: {len(df_dict[df_])}')
txtDf = open('D:/My_directory/{df_dict[df_]}.txt', 'w')
txtDf.write(df)
txtDf.close()
The problem with this code is that we cannot write several .txt files automatically, everything else works just fine. Can anybody figure it out?
If it is a list then you can iterate through it and save each element as string
import os
for key, value in df_dict.items():
with open(f'D:/My_directory/{key}.txt', "w") as file:
file.write('\n'.join(str(v) for v in value))

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)

Save columns as csv pandas

I'm trying to save specific columns to a csv using pandas. However, there is only one line on the output file. Is there anything wrong with my code? My desired output is to save all columns where d.count() == 1 to a csv file.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
for columns in d:
if (d[columns]).count() > 1:
(d[columns]).dropna(how='any').to_csv('output.csv')
An alternative might be to populate a new dataframe containing what you want to save, and then save that one time.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
keepcols=[]
for columns in d:
if (d[columns]).count() > 1:
keepcols.append(columns)
output_df = results[keepcols]
output_df.to_csv('output.csv')
No doubt you could rationalise the above, and reduce the memory footprint by saving the output directly without first creating an object to hold it, but it helps identify what's going on in the example.

Categories

Resources