I have a csv - file like this:
1.149, 1.328, 1.420, 1.148
and that's my current code:
import pandas as pd
df = pd.read_csv("right.csv")
Python works on columns and rows for output.
But I would like to have such an output:
1.149,
1.328,
1.420,
1.148
I need it that way, because afterwards I want to know how much data is in the CSV file and work with it. But now it just tells me that I have one row and 4 column.
Could someone help me please?
From my understanding, there is only one row of data like the one you had shown as an example:
1.149, 1.328, 1.420, 1.148
You can replace the white space with a new line \n.
import pandas as pd
df = pd.read_csv("right.csv")
print(df.replace(", ", ",\n"))
Which will give you the result you are expecting according to my understanding:
1.149,
1.328,
1.420,
1.148
This sounds like an XY Problem, but if you simply want to know the number of fields, count the commas and newlines!
This might only be approximate, as it'll depend on how consistent your input is
count = 0
with open("path/source.csv") as fh:
for line in fh: # iterate over lines
if not line:
continue
count += 1 # each line is a new field
count += line.count(",")
Otherwise, perhaps you are looking to Transpose (Wikipedia) the dataframe
>>> import pandas as pd
>>> df = pd.read_csv("test.csv")
>>> df
Empty DataFrame
Columns: [1.149, 1.328, 1.420, 1.148]
Index: []
>>> df.T
Empty DataFrame
Columns: []
Index: [1.149, 1.328, 1.420, 1.148]
>>> df2 = pd.DataFrame({'a': [1,2], 'b': [3,4]})
>>> df2
a b
0 1 3
1 2 4
>>> df2.T
0 1
a 1 2
b 3 4
You could probably do something like this:
import pandas as pd
df = pd.read_csv("right.csv")
for column in df:
index = str(df[column]).find('1')
print(str(df[column])[index:index+5])
Related
Im trying to read data from a log file I have in Python. Suppose the file is called data.log.
The content of the file looks as follows:
# Performance log
# time, ff, T vector, dist, windnorth, windeast
0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000
1.00000000,3.02502604,343260.68655952,384.26845401,-7.70828175,-0.45288215
2.00000000,3.01495320,342124.21684440,767.95286901,-7.71506536,-0.45123853
3.00000000,3.00489957,340989.57100678,1151.05303883,-7.72185550,-0.44959182
I would like to obtain the last two columns and put them into two separate lists, such that I get an output like:
list1 = [-7.70828175, -7.71506536, -7.71506536]
list2 = [-0.45288215, -0.45123853, -0.44959182]
I have tried reading the data with the following code as shown below, but instead of separate columns and rows I just get one whole column with three rows in return.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
file = open('data.log', 'r')
df = pd.read_csv('data.log', sep='\\s+')
df = list(df)
print (df[0])
Could someone indicate what I have to adjust in my code to obtain the required output as indicated above?
Thanks in advance!
import pandas as pd
df = pd.read_csv('sample.txt', skiprows=3, header=None,
names=['time', 'ff', 'T vector', 'dist', 'windnorth', 'windeast'])
spam = list(df['windeast'])
print(spam)
# store a specific column in a list
df['wind_diff'] = df.windnorth - df['windeast'] # two different ways to access columsn
print(df)
print(df['wind_diff'])
output
[-0.45288215, -0.45123853, -0.44959182]
time ff T vector dist windnorth windeast wind_diff
0 1.0 3.025026 343260.686560 384.268454 -7.708282 -0.452882 -7.255400
1 2.0 3.014953 342124.216844 767.952869 -7.715065 -0.451239 -7.263827
2 3.0 3.004900 340989.571007 1151.053039 -7.721856 -0.449592 -7.272264
0 -7.255400
1 -7.263827
2 -7.272264
Name: wind_diff, dtype: float64
Note, for creating plot in matplotlib you can work with pandas.Series directly, no need to store it in a list.
The error comes in the sep attribute. If you remove it, it will use the default (the comma) which is the one you need:
e.g.
>>> import pandas as pd
>>> import numpy as np
>>> file = open('data.log', 'r')
>>> df = pd.read_csv('data.log') # or use sep=','
>>> df = list(df)
>>> df[0]
'1.00000000'
>>> df[5]
'-0.45288215'
Plus use skiprows to get out the headers.
How do I create a dataframe from a string that look like this (part of the string)
,file_05,,\r\nx data,y
data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,
Trying to make a dataframe that look like this
In the code below s is the string:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s)).dropna(axis=1)
df.rename(columns={df.columns[0]: ""}, inplace=True)
By the way, if the string comes from a csv file then it is simpler to read the file directly using pd.read_csv.
Edit: This code will create a multiindex of columns:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s), header = None).dropna(how="all", axis=1).T
df[0] = df.loc[1, 0]
df = df.set_index([0, 1]).T
Looks like you want a multi-level dataframe from the string. Here's how I would do it.
Step 1: Split the string by '\r\n'. Then for each value, split by
','
Step 2: The above step will create a list of list. Element #0 has 4
items and element #1 has 2 items. The rest have 3 items each and is
the actual data
Step 3: Convert the data into a dictionary from element #3 onwards.
Use values in element #2 as keys for the dictionary (namely x data
and y data). To ensure you have key:[list of values], use the
dict.setdefault(key,[]).append(value). This will ensure the data
is created as a `key:[list of values]' dictionary.
Step 4: Create a normal dataframe using the dictionary as all the
values are stored as key and values in the dictionary.
Step 5: Now that you have the dictionary, you want to create the
MultiIndex. Convert the column to MultiIndex.
Putting all this together, the code is:
import pandas as pd
text = ',file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,'
line_text = [txt.split(',') for txt in text.split('\r\n')]
dct = {}
for x,y,z in line_text[2:]:
dct.setdefault(line_text[1][0], []).append(x)
dct.setdefault(line_text[1][1], []).append(y)
df = pd.DataFrame(dct)
df.columns = pd.MultiIndex.from_tuples([(line_text[0][i],line_text[1][i]) for i in [0,1]])
print (df)
Output of this will be:
file_05
x data y data
0 -970.0 -34.12164
1 -959.0 -32.37526
2 -949.0 -30.360199
3 -938.0 -28.74816
4 -929.0 -27.53912
5 -920.0 -25.92707
6 -911.0 -24.31503
7 -900.0 -23.64334
8 -891.0 -22.29997
You should convert your raw data to a table with python.
Save to csv file by import csv package with python.
from pandas import DataFrame
# s is raw datas
s = ",file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,"
# convert raw data to a table
table = [i.split(',') for i in s.split("\r\n")]
table = [i[:2] for i in table]
# table is like
"""
[['', 'file_05'],
['x data', 'y data'],
['-970.0', '-34.12164'],
['-959.0', '-32.37526'],
['-949.0', '-30.360199'],
...
['-891.0', '-22.29997']]
"""
# save to output.csv file
import csv
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table)
# Save to DataFrame df
from pandas import DataFrame
df = DataFrame (table[2:],columns=table[1][:2])
print(df)
I have a data frame frame from pandas and now I want to add columns names, but only for the second row. Here is an example of my previous output:
Desired output:
My code:
data_line=open("file1.txt", mode="r")
lines=[]
for line in data_line:
lines.append(line)
for i, line in enumerate(lines):
# print('{}={}'.format(i+1, line.strip()))
file1_header=lines[0]
num_line=1
Dictionary_File1={}
Value_File1= data_type[0:6]
Value_File1_short=[]
i=1
for element in Value_File1:
type=element.split(',')
Value_File1_short.append(type[0] + ", " + type[1] + ", " + type[4])
i += 1
Dictionary_File1[ file1_header]=Value_File1_short
pd_file1=pd.DataFrame.from_dict(Dictionary_File1)
You should have a look at DataFrame.read_csv. The header keyword parameter allows you to indicate a line in the file to use for header names.
You could probably do it with something like:
pd.read_csv("file1.txt", header=1)
From my python shell I tested it out with:
>>> from io import StringIO # I use python3
>>> import pandas as pd
>>> >>> data = """Type Type2 Type3
... A B C
... 1 2 3
... red blue green"""
>>> # StringIO below allows us to use "data" as input to read_csv
>>> # "sep" keyword is used to indicate how columns are separated in data
>>> df = pd.read_csv(StringIO(data), header=1, sep='\s+')
>>> df
A B C
0 1 2 3
1 red blue green
You can write a row using the csv module before writing your dataframe to the same file. Notice this won't help when reading back to Pandas, which doesn't work with "duplicate headers". You can create MultiIndex columns, but this isn't necessary for your desired output.
import pandas as pd
import csv
from io import StringIO
# input file
x = """A,B,C
1,2,3
red,blue,green"""
# replace StringIO(x) with 'file.txt'
df = pd.read_csv(StringIO(x))
with open('file.txt', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['Type', 'Type2', 'Type3'])
df.to_csv(fout, index=False)
# read file to check output is correct
df = pd.read_csv('file.txt')
print(df)
# Type Type2 Type3
# 0 A B C
# 1 1 2 3
# 2 red blue green
So if I understand properly, you have a file "file.txt" containing your data, and a list containing the types of your data.
You want to add the list of types, to the pandas.DataFrame of your data. Correct?
If so, you can read the data from the txt file into a pandas.df using pandas.read_csv(), and then define the columns headers using df.columns.
So it would look something like:
df = pd.read_csv("file1.txt", header=None)
df.columns = data_type[0:6]
I hope this helps!
Cheers
I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)
If I have a file of 100+ columns, how can I make each column into an array, referenced by the column header, without having to do header1 = [1,2,3], header2 = ['a','b','c'] , and so on..?
Here is what I have so far, where headers is a list of the header names:
import pandas as pd
data = []
df = pd.read_csv('outtest.csv')
for i in headers:
data.append(getattr(df, i).values)
I want each element of the array headers to be the variable name of the corresponding data array in data (they are in order). Somehow I want one line that does this so that the next line I can say, for example, test = headername1*headername2.
import pandas as pd
If the headers are in the csv file, we can simply use:
df = pd.read_csv('outtest.csv')
If the headers are not present in the csv file:
headers = ['list', 'of', 'headers']
df = pd.read_csv('outtest.csv', header=None, names=headers)
Assuming headername1 and headername2 are constants:
test = df.headername1 * df.headername2
Or
test = df['headername1'] * df['headername2']
Assuming they are variable:
test = df[headername1] * df[headername2]
By default this form of access returns a pd.Series, which is generally interoperable with numpy. You can fetch the values explicitly using .values:
df[headername1].values
But you seem to already know this.
I think I see what you're going for, so using a StringIO object to simulate a file object as the setup:
import pandas as pd
import StringIO
txt = '''foo,bar,baz
1, 2, 3
3, 2, 1'''
fileobj = StringIO.StringIO(txt)
Here's the approximate code you want:
data = []
df = pd.read_csv(fileobj)
for i in df.columns:
data.append(df[i])
for i in data:
print i
prints
0 1
1 3
Name: foo
0 2
1 2
Name: bar
0 3
1 1
Name: baz