converting a dataframe to a csv file - python

I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly. The headers are not preserved and some columns are now combined. I have looked through the page and online, but what I have tried is not working. I load the data in with the following code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations. How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
When re-loading the data I use the code:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head the output is:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts()) the output is:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns

If you pickle it like so:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.

If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index. Otherwise, while reading the csv file again you'd need to specify the index_col parameter.
According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).

I replicated your code and it works for me. The order of the columns is preserved.
Let me show what I tried. Tried this batch of code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly. Then
df = Adult
This also worked.
Then I saved this data frame to a csv file. Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df. It will generate a new column for keeping track of index. It is unnecessary and you can drop it like following:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code
new_df.columns == df.columns
I get
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here. You only need to save it once.
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')

When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_listmethod:
print(df.loc[:,"age"].value_counts().to_list())

Related

Reading data from log file in Python

Im trying to read data from a log file I have in Python. Suppose the file is called data.log.
The content of the file looks as follows:
# Performance log
# time, ff, T vector, dist, windnorth, windeast
0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000
1.00000000,3.02502604,343260.68655952,384.26845401,-7.70828175,-0.45288215
2.00000000,3.01495320,342124.21684440,767.95286901,-7.71506536,-0.45123853
3.00000000,3.00489957,340989.57100678,1151.05303883,-7.72185550,-0.44959182
I would like to obtain the last two columns and put them into two separate lists, such that I get an output like:
list1 = [-7.70828175, -7.71506536, -7.71506536]
list2 = [-0.45288215, -0.45123853, -0.44959182]
I have tried reading the data with the following code as shown below, but instead of separate columns and rows I just get one whole column with three rows in return.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
file = open('data.log', 'r')
df = pd.read_csv('data.log', sep='\\s+')
df = list(df)
print (df[0])
Could someone indicate what I have to adjust in my code to obtain the required output as indicated above?
Thanks in advance!
import pandas as pd
df = pd.read_csv('sample.txt', skiprows=3, header=None,
names=['time', 'ff', 'T vector', 'dist', 'windnorth', 'windeast'])
spam = list(df['windeast'])
print(spam)
# store a specific column in a list
df['wind_diff'] = df.windnorth - df['windeast'] # two different ways to access columsn
print(df)
print(df['wind_diff'])
output
[-0.45288215, -0.45123853, -0.44959182]
time ff T vector dist windnorth windeast wind_diff
0 1.0 3.025026 343260.686560 384.268454 -7.708282 -0.452882 -7.255400
1 2.0 3.014953 342124.216844 767.952869 -7.715065 -0.451239 -7.263827
2 3.0 3.004900 340989.571007 1151.053039 -7.721856 -0.449592 -7.272264
0 -7.255400
1 -7.263827
2 -7.272264
Name: wind_diff, dtype: float64
Note, for creating plot in matplotlib you can work with pandas.Series directly, no need to store it in a list.
The error comes in the sep attribute. If you remove it, it will use the default (the comma) which is the one you need:
e.g.
>>> import pandas as pd
>>> import numpy as np
>>> file = open('data.log', 'r')
>>> df = pd.read_csv('data.log') # or use sep=','
>>> df = list(df)
>>> df[0]
'1.00000000'
>>> df[5]
'-0.45288215'
Plus use skiprows to get out the headers.

How to split CSV column data into two colums in Python

I have the following code (below) that grabs to CSV files and merges data into one consolidated CSV file.
I now need to grab specific information from one of the columns add that information to another column.
What I have now is one output.csv file with the following sample data:
ID,Name,Flavor,RAM,Disk,VCPUs
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2
What I need to do is open this CSV file and split the data in the Name column across two columns as followed:
ID,Name,Flavor,RAM,Disk,VCPUs,Customer,Misc
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2,customer1,test1-dns
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1,customer2,test2
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2,customer3,test3-dns-api
Note how the Misc column can have multiple values split by one or multiple -.
How can I accomplish this via Python. Below is the code I have now:
import csv
import os
import pandas as pd
by_name = {}
with open('flavor.csv') as b:
for row in csv.DictReader(b):
name = row.pop('Name')
by_name[name] = row
with open('output.csv', 'w') as c:
w = csv.DictWriter(c, ['ID', 'Name', 'Flavor', 'RAM', 'Disk', 'VCPUs'])
w.writeheader()
with open('instance.csv') as a:
for row in csv.DictReader(a):
try:
match = by_name[row['Flavor']]
except KeyError:
continue
row.update(match)
w.writerow(row)
Try this:
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
df
Output:
ID Name Flavor RAM Disk VCPUs Customer Misc
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2 customer1 test1-dns
1 83dbc739-e436-4c9f-a561-c5b40a3a6da5 customer2-test2 m1.tiny 128 1 1 customer2 test2
2 ef68fcf3-f624-416d-a59b-bb8f1aa2a769 customer3-test3-dns-api m1.medium 4096 40 2 customer3 test3-dns-api
I would recommend switching over to pandas. Here's the official Getting Started documentation.
Let's first read in the csv.
import pandas as pd
df = pd.read_csv('input.csv')
print(df.head(1))
You should get something similar to:
ID Name Flavor RAM Disk VCPUs
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2
After that, use string manipulation in the Pandas Series:
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
Finally, you can save the csv.
df.to_csv('output.csv')
This code would be much elegant and simpler if you used pandas.
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df['Name'].str.split(pat='-',n=1,expand=True)
df.to_csv('output.csv',index=False)
Documentation ref
Here is how I do it. The trick is in the function "split()" :
import pandas as pd
file = pd.read_csv(r"C:\...\yourfile.csv",sep=",")
file['Customer']=None
file['Misc']=None
for x in range(len(file)):
temp=file.Name[x].split('-', maxsplit=1)
file['Customer'].iloc[x] = temp[0]
file['Misc'].iloc[x] = temp[1]
file.to_csv(r"C:\...\yourfile_result.csv")

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)

pd.read_csv ignores columns that don't have headers

I have a .csv file that is generated by a third-party program. The data in the file is in the following format:
%m/%d/%Y 49.78 85 6 15
03/01/1984 6.63368 82 7 9.8 34.29056405 2.79984079 2.110346498 0.014652412 2.304545521 0.004732732
03/02/1984 6.53368 68 0 0.2 44.61471002 3.21623666 2.990408898 0.077444779 2.793385466 0.02661873
03/03/1984 4.388344 55 6 0 61.14463457 3.637231063 3.484310818 0.593098236 3.224973641 0.214360796
There are 5 column headers (row 1 in excel, columns A-E) but 11 columns in total (row 1 columns F-K are empty, rows 2-N contain float values for columns A-K)
I was not sure how to paste the .csv lines in so they are easily replicable, sorry for that. An image of the excel sheet is shown here: Excel sheet to read in
when I use the following code:
FWInds=pd.read_csv("path.csv")
or:
FWInds=pd.read_csv("path.csv", header=None)
the resulting dataframe FWInds does not contain the last 6 columns - it only contains the columns with headers (columns A-E from excel, column A as index values).
FWIDat.shape
Out[48]: (245, 4)
Ultimately the last 6 columns are the only ones I even want to read in.
I also tried:
FWInds=pd.read_csv('path,csv', header=None, index_col=False)
but got the following error
CParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 11
I also tried to ignore the first row since the column titles are unimportant:
FWInds=pd.read_csv('path.csv', header=None, skiprows=0)
but get the same error.
Also no luck with the "usecols" parameter, it doesn't seem to understand that I'm referring to the column numbers (not names), unless I'm doing it wrong:
FWInds=pd.read_csv('path.csv', header=None, usecols=[5,6,7,8,9,10])
Any tips? I'm sure it's an easy fix but I'm very new to python.
There are a couple of parameters that can be passed to pd.read_csv():
import pandas as pd
colnames = list('ABCDEFGHIKL')
df = pd.read_csv('test.csv', sep='\t', names=colnames)
With this, I can actually import your data quite fine (and it is accessible via eg df['K'] afterwards).
You could do it as shown:
col_name = list('ABCDEFGHIJK')
data = 'path.csv'
pd.read_csv(data, delim_whitespace=True, header=None, names=col_name, usecols=col_name[5:])
To read all the columns from A → K, simply omit the usecols parameter.
Data:
data = StringIO(
'''
%m/%d/%Y,49.78,85,6,15
03/01/1984,6.63368,82,7,9.8,34.29056405,2.79984079,2.110346498,0.014652412,2.304545521,0.004732732
03/02/1984,6.53368,68,0,0.2,44.61471002,3.21623666,2.990408898,0.077444779,2.793385466,0.02661873
03/03/1984,4.388344,55,6,0,61.14463457,3.637231063,3.484310818,0.593098236,3.224973641,0.214360796
''')
col_name = list('ABCDEFGHIJK')
pd.read_csv(data, header=None, names=col_name, usecols=col_name[5:])

pandas - Joining CSV time series into a single dataframe

I'm trying to get 4 CSV files into one dataframe. I've looked around on the web for examples and tried a few but they all give errors. Finally I think I'm onto something, but it gives unexpected results. Can anybody tell me why this doesn't work?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 24*365*4
dates = pd.date_range('20120101',periods=n,freq='h')
df = pd.DataFrame(np.random.randn(n,1),index=dates,columns=list('R'))
#df = pd.DataFrame(index=dates)
paths = ['./LAM DIV/10118218_JAN_LAM_DIV_1.csv',
'./LAM DIV/10118218_JAN-APR_LAM_DIV_1.csv',
'./LAM DIV/10118250_JAN_LAM_DIV_2.csv',
'./LAM DIV/10118250_JAN-APR_LAM_DIV_2.csv']
for i in range(len(paths)):
data = pd.read_csv(paths[i], index_col=0, header=0, parse_dates=True)
df.join(data['TempC'])
df.head()
Expected result:
Date Time R 0 1 2 3
Getting this:
Date Time R
You need to save the result of your join:
df = df.join(data['TempC'])

Categories

Resources