Im trying to read data from a log file I have in Python. Suppose the file is called data.log.
The content of the file looks as follows:
# Performance log
# time, ff, T vector, dist, windnorth, windeast
0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000
1.00000000,3.02502604,343260.68655952,384.26845401,-7.70828175,-0.45288215
2.00000000,3.01495320,342124.21684440,767.95286901,-7.71506536,-0.45123853
3.00000000,3.00489957,340989.57100678,1151.05303883,-7.72185550,-0.44959182
I would like to obtain the last two columns and put them into two separate lists, such that I get an output like:
list1 = [-7.70828175, -7.71506536, -7.71506536]
list2 = [-0.45288215, -0.45123853, -0.44959182]
I have tried reading the data with the following code as shown below, but instead of separate columns and rows I just get one whole column with three rows in return.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
file = open('data.log', 'r')
df = pd.read_csv('data.log', sep='\\s+')
df = list(df)
print (df[0])
Could someone indicate what I have to adjust in my code to obtain the required output as indicated above?
Thanks in advance!
import pandas as pd
df = pd.read_csv('sample.txt', skiprows=3, header=None,
names=['time', 'ff', 'T vector', 'dist', 'windnorth', 'windeast'])
spam = list(df['windeast'])
print(spam)
# store a specific column in a list
df['wind_diff'] = df.windnorth - df['windeast'] # two different ways to access columsn
print(df)
print(df['wind_diff'])
output
[-0.45288215, -0.45123853, -0.44959182]
time ff T vector dist windnorth windeast wind_diff
0 1.0 3.025026 343260.686560 384.268454 -7.708282 -0.452882 -7.255400
1 2.0 3.014953 342124.216844 767.952869 -7.715065 -0.451239 -7.263827
2 3.0 3.004900 340989.571007 1151.053039 -7.721856 -0.449592 -7.272264
0 -7.255400
1 -7.263827
2 -7.272264
Name: wind_diff, dtype: float64
Note, for creating plot in matplotlib you can work with pandas.Series directly, no need to store it in a list.
The error comes in the sep attribute. If you remove it, it will use the default (the comma) which is the one you need:
e.g.
>>> import pandas as pd
>>> import numpy as np
>>> file = open('data.log', 'r')
>>> df = pd.read_csv('data.log') # or use sep=','
>>> df = list(df)
>>> df[0]
'1.00000000'
>>> df[5]
'-0.45288215'
Plus use skiprows to get out the headers.
Related
I have a csv - file like this:
1.149, 1.328, 1.420, 1.148
and that's my current code:
import pandas as pd
df = pd.read_csv("right.csv")
Python works on columns and rows for output.
But I would like to have such an output:
1.149,
1.328,
1.420,
1.148
I need it that way, because afterwards I want to know how much data is in the CSV file and work with it. But now it just tells me that I have one row and 4 column.
Could someone help me please?
From my understanding, there is only one row of data like the one you had shown as an example:
1.149, 1.328, 1.420, 1.148
You can replace the white space with a new line \n.
import pandas as pd
df = pd.read_csv("right.csv")
print(df.replace(", ", ",\n"))
Which will give you the result you are expecting according to my understanding:
1.149,
1.328,
1.420,
1.148
This sounds like an XY Problem, but if you simply want to know the number of fields, count the commas and newlines!
This might only be approximate, as it'll depend on how consistent your input is
count = 0
with open("path/source.csv") as fh:
for line in fh: # iterate over lines
if not line:
continue
count += 1 # each line is a new field
count += line.count(",")
Otherwise, perhaps you are looking to Transpose (Wikipedia) the dataframe
>>> import pandas as pd
>>> df = pd.read_csv("test.csv")
>>> df
Empty DataFrame
Columns: [1.149, 1.328, 1.420, 1.148]
Index: []
>>> df.T
Empty DataFrame
Columns: []
Index: [1.149, 1.328, 1.420, 1.148]
>>> df2 = pd.DataFrame({'a': [1,2], 'b': [3,4]})
>>> df2
a b
0 1 3
1 2 4
>>> df2.T
0 1
a 1 2
b 3 4
You could probably do something like this:
import pandas as pd
df = pd.read_csv("right.csv")
for column in df:
index = str(df[column]).find('1')
print(str(df[column])[index:index+5])
I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly. The headers are not preserved and some columns are now combined. I have looked through the page and online, but what I have tried is not working. I load the data in with the following code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations. How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
When re-loading the data I use the code:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head the output is:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts()) the output is:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns
If you pickle it like so:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.
If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index. Otherwise, while reading the csv file again you'd need to specify the index_col parameter.
According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).
I replicated your code and it works for me. The order of the columns is preserved.
Let me show what I tried. Tried this batch of code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly. Then
df = Adult
This also worked.
Then I saved this data frame to a csv file. Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df. It will generate a new column for keeping track of index. It is unnecessary and you can drop it like following:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code
new_df.columns == df.columns
I get
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here. You only need to save it once.
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_listmethod:
print(df.loc[:,"age"].value_counts().to_list())
I have a list of string values I read this from a text document with splitlines. which yields something like this
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
I have tried this
for i in X:
textnew = i.split("|")
data[x] = textnew
I want to make a dataframe out of this
Name Contact Education
SMITH 12345 Graduate
NITA 11111 Diploma
You can read it directly from your file by specifying a sep argument to pd.read_csv.
df = pd.read_csv("/path/to/file", sep='|')
Or if you wish to convert it from list of string instead:
data = [row.split('|') for row in X]
headers = data.pop(0) # Pop the first element since it's header
df = pd.DataFrame(data, columns=headers)
you had it almost correct actually, but don't use data as dictionary(by using keys - data[x] = textnew):
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
df = []
for i in X:
df.append(i.split("|"))
print(df)
# [['NAME', 'Contact', 'Education'], ['SMITH', '12345', 'Graduate'], ['NITA', '11111', 'Diploma']]
Depends on further transformations, but pandas might be overkill for this kind of task
Here is a solution for your problem
import pandas as pd
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
data = []
for i in X:
data.append( i.split("|") )
df = pd.DataFrame( data, columns=data.pop(0))
In your situation, you can avoid to load the file using readlines and use pandas for take care about loading the file:
As mentioned above, the solution is a standard read_csv:
import os
import pandas as pd
path = "/tmp"
filepath = "file.xls"
filename = os.path.join(path,filepath)
df = pd.read_csv(filename, sep='|')
print(df.head)
Another approach (in such situation when you have no access to the file or you have to deal with a list of string) can be wrap the list of string as a text file, then load normally using pandas
import pandas as pd
from io import StringIO
X = ["NAME|Contact|Education", "SMITH|12345|Graduate", "NITA|11111|Diploma"]
# Wrap the string list as a file of new line
DATA = StringIO("\n".join(X))
# Load as a pandas dataframe
df = pd.read_csv(DATA, delimiter="|")
Here the result
I'm trying to get 4 CSV files into one dataframe. I've looked around on the web for examples and tried a few but they all give errors. Finally I think I'm onto something, but it gives unexpected results. Can anybody tell me why this doesn't work?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 24*365*4
dates = pd.date_range('20120101',periods=n,freq='h')
df = pd.DataFrame(np.random.randn(n,1),index=dates,columns=list('R'))
#df = pd.DataFrame(index=dates)
paths = ['./LAM DIV/10118218_JAN_LAM_DIV_1.csv',
'./LAM DIV/10118218_JAN-APR_LAM_DIV_1.csv',
'./LAM DIV/10118250_JAN_LAM_DIV_2.csv',
'./LAM DIV/10118250_JAN-APR_LAM_DIV_2.csv']
for i in range(len(paths)):
data = pd.read_csv(paths[i], index_col=0, header=0, parse_dates=True)
df.join(data['TempC'])
df.head()
Expected result:
Date Time R 0 1 2 3
Getting this:
Date Time R
You need to save the result of your join:
df = df.join(data['TempC'])
If I have a file of 100+ columns, how can I make each column into an array, referenced by the column header, without having to do header1 = [1,2,3], header2 = ['a','b','c'] , and so on..?
Here is what I have so far, where headers is a list of the header names:
import pandas as pd
data = []
df = pd.read_csv('outtest.csv')
for i in headers:
data.append(getattr(df, i).values)
I want each element of the array headers to be the variable name of the corresponding data array in data (they are in order). Somehow I want one line that does this so that the next line I can say, for example, test = headername1*headername2.
import pandas as pd
If the headers are in the csv file, we can simply use:
df = pd.read_csv('outtest.csv')
If the headers are not present in the csv file:
headers = ['list', 'of', 'headers']
df = pd.read_csv('outtest.csv', header=None, names=headers)
Assuming headername1 and headername2 are constants:
test = df.headername1 * df.headername2
Or
test = df['headername1'] * df['headername2']
Assuming they are variable:
test = df[headername1] * df[headername2]
By default this form of access returns a pd.Series, which is generally interoperable with numpy. You can fetch the values explicitly using .values:
df[headername1].values
But you seem to already know this.
I think I see what you're going for, so using a StringIO object to simulate a file object as the setup:
import pandas as pd
import StringIO
txt = '''foo,bar,baz
1, 2, 3
3, 2, 1'''
fileobj = StringIO.StringIO(txt)
Here's the approximate code you want:
data = []
df = pd.read_csv(fileobj)
for i in df.columns:
data.append(df[i])
for i in data:
print i
prints
0 1
1 3
Name: foo
0 2
1 2
Name: bar
0 3
1 1
Name: baz