the csv data is like this:
,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
and I want here is my program:
data = pd.read_csv('train.csv',delimiter=',')
group = data.drop('quality',axis=1).values
print(group[0])
I want the result is 7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6, but the it comes 0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8. So how to avoid the index column?
There is problem your data before first , are not converted to index, so need index_col=[0]. Then after call .values first column is omited:
data = pd.read_csv('train.csv',delimiter=',', index_col=[0])
Or:
data = pd.read_csv('train.csv', index_col=[0])
Related
I have data that looks like this (from jq)
script_runtime{application="app1",runtime="1651394161"} 1651394161
folder_put_time{application="app1",runtime="1651394161"} 22
folder_get_time{application="app1",runtime="1651394161"} 128.544
folder_ls_time{application="app1",runtime="1651394161"} 3.868
folder_ls_count{application="app1",runtime="1651394161"} 5046
The dataframe should allow manipulation of each row to this:
script_runtime,app1,1651394161,1651394161
folder_put_time,app1,1651394161,22
Its in a textfile. How can I easily load it into pandas for data manipulation?
Load the .txt using pd.read_csv(), specifying a space as the separator (similar StackOverflow answer). The result will be a two-column dataframe with the bracketed text in the first column, and the float in the second column.
df = pd.read_csv("textfile.txt", header=None, delimiter=r"\s+")
Parse the bracketed text into separate columns:
df['function'] = df[0].str.split("{",expand=True)[0]
df['application'] = df[0].str.split("\"",expand=True)[1]
df['runtime'] = df[0].str.split("\"",expand=True)[3]
The result is a dataframe looks like this:
If you want to drop the first column which contains the bracketed value:
df = df.iloc[: , 1:]
Full code:
df = pd.read_csv("textfile.txt", header=None, delimiter=r"\s+")
df['function'] = df[0].str.split("{",expand=True)[0]
df['application'] = df[0].str.split("\"",expand=True)[1]
df['runtime'] = df[0].str.split("\"",expand=True)[3]
df = df.iloc[: , 1:]
I have the following df, with the row 0 being the header:
teacher,grade,subject
black,a,english
grayson,b,math
yodd,a,science
What is the best way to use export_csv in python to save each row to a csv so that the files are named:
black.csv
grayson.csv
yodd.csv
Contents of black.csv will be:
teacher,grade,subject
black,a,english
Thanks in advance!
Updated Code:
df8['CaseNumber'] = df8['CaseNumber'].map(str)
df8.set_index('CaseNumber', inplace=True)
for Casenumber, data in df8.iterrows():
data.to_csv('c:\\users\\admin\\' + Casenumber + '.csv')'''
This can be done simply by using pandas:
import pandas as pd
# Preempt the issue of columns being numeric by marking dtype=str
df = pd.read_csv('your_data.csv', header=1, dtype=str)
df.set_index('teacher', inplace=True)
for teacher, data in df.iterrows():
data.to_csv(teacher + '.csv')
Edits:
df8.set_index('CaseNumber', inplace=True)
for Casenumber, data in df8.iterrows():
# Use r and f strings to make your life easier:
data.to_csv(rf'c:\users\admin\{Casenumber}.csv')
I want to read the csv file and I am trying to make the date as the index column. However, this "international visitor arrivals statistics" can't be removed!!! How do I remove this annoying header? I have no idea how it got there and how to remove it.
import pandas as pd
import datetime
data5 = pd.read_csv('visitor.csv', parse_dates = [0], index_col=[0])
#data5 = data5.drop([0,1,2], axis = 0) # delete rows with irrelevant data
data5.columns = data5.iloc[3] # set the new header row with the proper header
data5 = data5[4:7768] # Take remaining data less the irrelevant data and the header row
data5
my output
Original excel file
Try using the header parameter in pd.read_csv which sets the row you want to use as your header in your df so for you, you would want to use the 5th row so you'd set the header=4 like this:
data5 = pd.read_csv('visitor.csv', parse_dates = [0], index_col=[0], header=4)
I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)
I'm trying to save specific columns to a csv using pandas. However, there is only one line on the output file. Is there anything wrong with my code? My desired output is to save all columns where d.count() == 1 to a csv file.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
for columns in d:
if (d[columns]).count() > 1:
(d[columns]).dropna(how='any').to_csv('output.csv')
An alternative might be to populate a new dataframe containing what you want to save, and then save that one time.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
keepcols=[]
for columns in d:
if (d[columns]).count() > 1:
keepcols.append(columns)
output_df = results[keepcols]
output_df.to_csv('output.csv')
No doubt you could rationalise the above, and reduce the memory footprint by saving the output directly without first creating an object to hold it, but it helps identify what's going on in the example.