Python: Printing dataframe to csv - python

I am currently using this code:
import pandas as pd
AllDays = ['a','b','c','d']
TempDay = pd.DataFrame( np.random.randn(4,2) )
TempDay['Dates'] = AllDays
TempDay.to_csv('H:\MyFile.csv', index = False, header = False)
But when it prints it prints the array before the dates with a header row. I am seeking to print the dates before the TemperatureArray and no header rows.
Edit:
The file is with the TemperatureArray followed by Dates: [ TemperatureArray, Date].
-0.27724356949570034,-0.3096554106726788,a
-0.10619546908708237,0.07430127684522048,b
-0.07619665345406437,0.8474460146082116,c
0.19668718143436803,-0.8072994364484335,d
I am looking to print: [ Date TemperatureArray]
a,-0.27724356949570034,-0.3096554106726788
b,-0.10619546908708237,0.07430127684522048
c,-0.07619665345406437,0.8474460146082116
d,0.19668718143436803,-0.8072994364484335

The pandas.Dataframe.to_csv method has a keyword argument, header=True that can be turned off to disable headers.
However, it sometimes does not work (from experience).
Using it in conjunction with index=False should solve your issue.
For example, this snippet should fix your issue:
TempDay.to_csv('C:\MyFile.csv', index=False, header=False)
Here is a full example showing how it disables the header row:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(6,4))
>>> df
0 1 2 3
0 1.295908 1.127376 -0.211655 0.406262
1 0.152243 0.175974 -0.777358 -1.369432
2 1.727280 -0.556463 -0.220311 0.474878
3 -1.163965 1.131644 -1.084495 0.334077
4 0.769649 0.589308 0.900430 -1.378006
5 -2.663476 1.010663 -0.839597 -1.195599
>>> # just assigns sequential letters to the column
>>> df[4] = [chr(i+ord('A')) for i in range(6)]
>>> df
0 1 2 3 4
0 1.295908 1.127376 -0.211655 0.406262 A
1 0.152243 0.175974 -0.777358 -1.369432 B
2 1.727280 -0.556463 -0.220311 0.474878 C
3 -1.163965 1.131644 -1.084495 0.334077 D
4 0.769649 0.589308 0.900430 -1.378006 E
5 -2.663476 1.010663 -0.839597 -1.195599 F
>>> # here we reindex the headers and return a copy
>>> # using this form of indexing just requires you to provide
>>> # a list with all the columns you desire and in the order desired
>>> df2 = df[[4, 1, 2, 3]]
>>> df2
4 1 2 3
0 A 1.127376 -0.211655 0.406262
1 B 0.175974 -0.777358 -1.369432
2 C -0.556463 -0.220311 0.474878
3 D 1.131644 -1.084495 0.334077
4 E 0.589308 0.900430 -1.378006
5 F 1.010663 -0.839597 -1.195599
>>> df2.to_csv('a.txt', index=False, header=False)
>>> with open('a.txt') as f:
... print(f.read())
...
A,1.1273756275298716,-0.21165535441591588,0.4062624848191157
B,0.17597366083826546,-0.7773584823122313,-1.3694320591723093
C,-0.556463084618883,-0.22031139982996412,0.4748783498361957
D,1.131643603259825,-1.084494967896866,0.334077296863368
E,0.5893080536600523,0.9004299653290818,-1.3780062860066293
F,1.0106633581546611,-0.839597332636998,-1.1955992812601897
If you need to dynamically adjust the columns, and move the last column to the first, you can do as follows:
# this returns the columns as a list
columns = df.columns.tolist()
# removes the last column, the newest one you added
tofirst_column = columns.pop(-1)
# just move it to the start
new_columns = [tofirst_column] + columns
# then you can the rest
df2 = df[new_columns]
This simply allows you to take the current column list, construct a Python list from the current columns, and reindex the headers without having any prior knowledge on the headers.

Related

convert the dictionary of tuples and float values to a dataframe

I have a formatted data as dict[tuple[str, str], list[float]]
i want to convert it into a pandas dataframe
Example data:
{('A','B'): [-0.008035100996494293,0.008541940711438656]}
i tried using some data manipulations using split functions.
Expecting:-
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656], ('C','D'): [-0.008035100996494293,0.008541940711438656]}
title = []
heading = []
num_col1 = []
num_col2 = []
for key, val in data.items():
title.append(key[0])
heading.append(key[1])
num_col1.append(val[0])
num_col2.append(val[1])
data_ = {'title':title, 'heading':heading, 'num_col1':num_col1, 'num_col2':num_col1}
pd.DataFrame(data_)
Your best bet will be to construct your Index manually. For this we can use pandas.MultiIndex.from_tuples since your dictionary keys are stored as tuples. From there we just need to store the values of the dictionary into the body of a DataFrame.
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
index = pd.MultiIndex.from_tuples(data.keys(), names=['title', 'heading'])
df = pd.DataFrame(data.values(), index=index).reset_index()
print(df)
title heading 0 1
0 A B -0.008035 0.008542
If you want chained operation, you can do:
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
df = (
pd.DataFrame.from_dict(data, orient='index')
.pipe(lambda d:
d.set_axis(pd.MultiIndex.from_tuples(d.index, names=['title', 'heading']))
)
.reset_index()
)
print(df)
title heading 0 1
0 A B -0.008035 0.008542
Another possible solution, which works also if the tuples and lists vary in length:
pd.concat([pd.DataFrame.from_records([x for x in d.keys()],
columns=['title', 'h1', 'h2']),
pd.DataFrame.from_records([x[1] for x in d.items()])], axis=1)
Output:
title h1 h2 0 1 2
0 A B None -0.008035 0.008542 NaN
1 C B D -0.010351 1.008542 5.0
Data input:
d = {('A','B'): [-0.008035100996494293,0.008541940711438656],
('C','B', 'D'): [-0.01035100996494293,1.008541940711438656, 5]}
You can expand the keys and values as you iterate the dictionary items. Pandas will see 4 values which it will make into a row.
>>> import pandas as pd
>>> data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
>>> pd.DataFrame(((*d[0], *d[1]) for d in data.items()), columns=("Title", "Heading", "Foo", "Bar"))
Title Heading Foo Bar
0 A B -0.008035 0.008542

Adding Column to Pandas Dataframes from Python Generator

I'm reading a number of csv files into python using a glob matching and would like to add the filename as a column in each of the dataframes. I'm currently matching on a pattern and then using a generator to read in the files as so:
base_list_of_files = glob.glob(matching_pattern)
loaded_csv_data_frames = (pd.read_csv(csv, encoding= 'latin-1') for csv in base_list_of_files)
for idx, df in enumerate(loaded_csv_data_frames):
df['file_origin'] = base_list_of_files[idx]
combined_data = pd.concat(loaded_csv_data_frames)
I however get the error ValueError: No objects to concatenate when I come to do the concatenation - why does the adding the column iteratively break the list of dataframes ?
Generators can only go through one iteration, at the end of which they throw a StopIteration exception which is automatically handled by the for loop. If you try to consume them again they will just raise StopIteration, as demonstrated here:
def consume(gen):
while True:
print(next(gen))
except StopIteration:
print("Stop iteration")
break
>>> gen = (i for i in range(2))
>>> consume(gen)
0
1
Stop iteration
>>> consume(gen)
Stop iteration
That's why you get the ValueError when you try to use loaded_csv_data_frames for a second time.
I cannot replicate your example, but here it is something that should be similar enough:
df1 = pd.DataFrame(0, columns=["a", "b"], index=[0, 1])
df2 = pd.DataFrame(1, columns=["a", "b"], index=[0, 1])
loaded_csv_data_frames = iter((df1, df2)) # Pretend that these are read from a csv file
base_list_of_files = iter(("df1.csv", "df2.csv")) # Pretend these file names come from glob
You can add the file of origin as a key when you concatenate. Add names too to give titles to your index levels.
>>> df = pd.concat(loaded_csv_data_frames, keys=base_list_of_files, names=["file_origin", "index"])
>>> df
a b
file_origin index
df1.csv 0 0 0
1 0 0
df2.csv 0 1 1
1 1 1
If you want file_origin to be one of your columns, just reset first level of the index.
>>> df.reset_index("file_origin")
file_origin a b
index
0 df1.csv 0 0
1 df1.csv 0 0
0 df2.csv 1 1
1 df2.csv 1 1

In Python, how to sort a dataframe containing accents?

I use sort_values to sort a dataframe. The dataframe contains UTF-8 characters with accents. Here is an example:
>>> df = pd.DataFrame ( [ ['i'],['e'],['a'],['é'] ] )
>>> df.sort_values(by=[0])
0
2 a
1 e
0 i
3 é
As you can see, the "é" with an accent is at the end instead of being after the "e" without accent.
Note that the real dataframe has several columns !
This is one way. The simplest solution, as suggested by #JonClements:
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
An alternative, long-winded solution, normalization code courtesy of #EdChum:
df = pd.DataFrame([['i'],['e'],['a'],['é']])
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
# remove accents
df[1] = df[0].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')
# sort by new column, then drop
df = df.sort_values(1, ascending=True)\
.drop(1, axis=1)
print(df)
0
2 a
1 e
3 é
0 i

adding rows to empty dataframe with columns

I am using Pandas and want to add rows to an empty DataFrame with columns already established.
So far my code looks like this...
def addRows(cereals,lines):
for i in np.arange(1,len(lines)):
dt = parseLine(lines[i])
dt = pd.Series(dt)
print(dt)
# YOUR CODE GOES HERE (add dt to cereals)
cereals.append(dt, ignore_index = True)
return(cereals)
However, when I run...
cereals = addRows(cereals,lines)
cereals
the dataframe returns with no rows, just the columns. I am not sure what I am doing wrong but I am pretty sure it has something to do with the append method. Anyone have any ideas as to what I am doing wrong?
There are two probably reasons your code is not operating as intended:
cereals.append(dt, ignore_index = True) is not doing what you think it is. You're trying to append a series, not a DataFrame there.
cereals.append(dt, ignore_index = True) does not modify cereals in place, so when you return it, you're returning an unchanged copy. An equivalent function would look like this:
--
>>> def foo(a):
... a + 1
... return a
...
>>> foo(1)
1
I haven't tested this on my machine, but I think you're fixed solution would look like this:
def addRows(cereals, lines):
for i in np.arange(1,len(lines)):
data = parseLine(lines[i])
new_df = pd.DataFrame(data, columns=cereals.columns)
cereals = cereals.append(new_df, ignore_index=True)
return cereals
by the way.. I don't really know where lines is coming from, but right away I would at least modify it to look like this:
data = [parseLine(line) for line in lines]
cereals = cereals.append(pd.DataFrame(data, cereals.columns), ignore_index=True)
How to add an extra row to a pandas dataframe
You could also create a new DataFrame and just append that DataFrame to your existing one. E.g.
>>> import pandas as pd
>>> empty_alph = pd.DataFrame(columns=['letter', 'index'])
>>> alph_abc = pd.DataFrame([['a', 0], ['b', 1], ['c', 2]], columns=['letter', 'index'])
>>> empty_alph.append(alph_abc)
letter index
0 a 0.0
1 b 1.0
2 c 2.0
As I noted in the link, you can also use the loc method on a DataFrame:
>>> df = empty_alph.append(alph_abc)
>>> df.loc[df.shape[0]] = ['d', 3] // df.shape[0] just finds next # in index
letter index
0 a 0.0
1 b 1.0
2 c 2.0
3 d 3.0

Using Pandas to export multiple rows of data to a csv

I have a matching algorithm which links students to projects. It's working, and I have trouble exporting the data to a csv file. It only takes the last value and exports that only, when there are 200 values to be exported.
The data that's exported uses each number as a value when I would like to get the whole 's' rather than the three 3 numbers which make up 's', which are split into three columns. I've attached the images below. Any help would be appreciated.
What it looks like
What it should look like
#Imports for Pandas
import pandas as pd
from pandas import DataFrame
SPA()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df = pd.DataFrame.from_records([dataPack])
df.to_csv('try.csv')
You keep overwriting in the loop so you only end up with the last bit of data, you need to append to the csv with df.to_csv('try.csv',mode="a",header=False) or create one df and append to that and write outside the loop, something like:
df = pd.DataFrame()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df.append(pd.DataFrame.from_records([dataPack]))
df.to_csv('try.csv') # write all data once outside the loop
A better option would be to open a file and pass that file object to to_csv:
with open('try.csv', 'w') as f:
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
pd.DataFrame.from_records([dataPack]).to_csv(f, header=False)
You get individual chars because you are using from_records passing a single string dataPack as the value so it iterates over the chars:
In [18]: df = pd.DataFrame.from_records(["foobar,"+"bar"])
In [19]: df
Out[19]:
0 1 2 3 4 5 6 7 8 9
0 f o o b a r , b a r
In [20]: df = pd.DataFrame(["foobar,"+"bar"])
In [21]: df
Out[21]:
0
0 foobar,bar
I think you basically want to leave as a tuple dataPack = (s, l, p,c, r) and use pd.DataFrame(dataPack). You don't really need pandas at all, the csv lib would do all this for you without needing to create Dataframes.

Categories

Resources