I have a matching algorithm which links students to projects. It's working, and I have trouble exporting the data to a csv file. It only takes the last value and exports that only, when there are 200 values to be exported.
The data that's exported uses each number as a value when I would like to get the whole 's' rather than the three 3 numbers which make up 's', which are split into three columns. I've attached the images below. Any help would be appreciated.
What it looks like
What it should look like
#Imports for Pandas
import pandas as pd
from pandas import DataFrame
SPA()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df = pd.DataFrame.from_records([dataPack])
df.to_csv('try.csv')
You keep overwriting in the loop so you only end up with the last bit of data, you need to append to the csv with df.to_csv('try.csv',mode="a",header=False) or create one df and append to that and write outside the loop, something like:
df = pd.DataFrame()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df.append(pd.DataFrame.from_records([dataPack]))
df.to_csv('try.csv') # write all data once outside the loop
A better option would be to open a file and pass that file object to to_csv:
with open('try.csv', 'w') as f:
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
pd.DataFrame.from_records([dataPack]).to_csv(f, header=False)
You get individual chars because you are using from_records passing a single string dataPack as the value so it iterates over the chars:
In [18]: df = pd.DataFrame.from_records(["foobar,"+"bar"])
In [19]: df
Out[19]:
0 1 2 3 4 5 6 7 8 9
0 f o o b a r , b a r
In [20]: df = pd.DataFrame(["foobar,"+"bar"])
In [21]: df
Out[21]:
0
0 foobar,bar
I think you basically want to leave as a tuple dataPack = (s, l, p,c, r) and use pd.DataFrame(dataPack). You don't really need pandas at all, the csv lib would do all this for you without needing to create Dataframes.
Related
What I'm trying to do is to use pandas to create as many separate data arrays as there are runs of my data set. The approach needs to be vary depending on the data file read in, so I want the run number, the second column, to be used to identify the data and separate it into separate data sets.
So I have a data set that looks like:
1.350000035018e-03 1.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 1.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 1.000000000000e+00 -2.062988281250e-06
(couple hundred lines later)
1.350000035018e-03 2.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 2.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 2.000000000000e+00 -2.062988281250e-06
(however many readings later)
1.350000035018e-03 35.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 35.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 35.000000000000e+00 -2.062988281250e-06
I want to process it into:
data1 = some number 1.0 some number
some number 1.0 some number
data2 = some number 2.0 some number
some number 2.0 some number
datan= some number n some number
some number n some number
So far my code:
f =r'C:~.dat'
#store data using pandas
data = pd.read_csv( f, sep = '\t', comment = '#', names = ['V','n','I'] )
#observe data format
print(data)
V n I
0 0.001350 1.0 -1.617387e-14
1 0.002850 1.0 -2.752686e-06
2 0.004350 1.0 -2.062988e-06
#count the loops for autamted graph plotting
num = 1
for i in range (len(data)):
if i > 0:
if data['n'][i]> data['n'][i-1]:
num = num + 1
#
print('there are '+str(num)+' runs')
#seperate data based on loop #n
for i in range (num):
run = data.groupby(data.n)
data+str(i) = run.get_group(i)
print(data+str(i))
#
using the data grouping method works, but I cant figure out a way to use the loop number as a name variable, any help/suggestions would be highly appreciated?
Do you need to explicitly name your dataframes or can it be part of a list or dict?
For instance, you could do something like this...
import pandas as pd
f =r'C:~.dat'
#store data using pandas
data = pd.read_csv( f, sep = '\t', comment = '#', names = ['V','n','I'] )
data_list = []
# get unique run entries
runs = data["n"].unique()
# save each run's corresponding dataframe into data_list
for run in runs:
data_sub = data[data["n"] == run]
data_list.append(data_sub)
# access it by doing something as follows
for idx, run in enumerate(runs):
print("Working on run {}".format(run))
df_to_operate_on = data_list[idx]
I'm not entirely sure I understand correctly what you're trying to achieve. But if you aim to have data like this:
1.350000035018e-03 1 -1.617387196395e-14
2.850000048056e-03 2 -2.752685546875e-06
4.350000061095e-03 3 -2.062988281250e-06
do you really need the n column?
Isn't that just the data.index + 1?
(the index in your example is [0, 1, 2], and you're looking for [1, 2, 3], so you might be able to do something like data.n = [i + 1 for i in data.index])
I recently posted on how to create multiple variables from a CSV file. The code worked in that I have the variables created. However, the code is creating a bunch of variables all equal to the first row. I need the code to make 1 variable for each row in the dataframe
I need 208000 variables labeled A1:A20800
The code I currently have:
df = pandas.read_csv(file_name)
for i in range(1,207999):
for c in df:
exec("%s = %s" % ('A' + str(i), c))
i += 1
I have tried adding additional quotation marks around the second %s (gives a syntax error). I have tried selecting all the rows of the df and using that. Not sure why it isn't working! Every time I print a variable to test if it worked, it is printing the same value, (i.e. A1 = A2 = A3...=A207999) What I actually want is:
A1 = row 1
A2 = row 2
.
.
.
Thank you in advance for any assistance!
I don't know how pandas reads a file, but I'm guessing it returns an iterable. In that case using islice should allow just 20800 rows to be read:
from itertools import islice
df = pandas.read_csv(file_name)
A = list(islice(df, 20800))
# now access rows: A[index]
If you want to create a list containing the values of each row from your DataFrame, you can use the method df.iterrows():
[row[1].to_list() for row in df.iterrows()]
If you still want to create a large number of variables, you can do so in a loop as:
for row in df.iterrows():
list_with_row_values = row[0].to_list()
# create your variables here...
You are getting the same value for all the variables because you are incrementing i in your inner for loop, so all the Annnn variables are probably set to the last value.
So you want something more like:
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[42, 42, 42]})
In [3]: df
Out[3]:
a b
0 1 42
1 2 42
2 3 42
In [28]: for c in df:
...: exec("%s = %s" % ('A' + str(i), c))
...: i += 1
...:
In [29]: A1
Out[29]:
(0L, a 1
b 42
Name: 0, dtype: int64)
In [30]: A1[0]
Out[30]: 0L
In [32]: A1[1]
Out[32]:
a 1
b 42
Name: 0, dtype: int64
I have a string as follows :
2017-11-27T09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"
2017-11-27T10:00:17.250-06:00,,"[0.13055123127828835, -62.0, -76.0]"
I would like to have following in my data frame:
09:59:57.278 0.2094101093721778 -65.0 -76.0
10:00:17.250 0.13055123127828835 -62.0 -76.0
I tried to strip the first value as:
a = "2017-11-27T09:59:57.278-06:00,,\"[0.2094101093721778, -65.0, -76.0]\""
b = a.strip("2017-11-27T")
I got following output :
9:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"
I actually wanted 09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"
Your strip removes all combination of the characters provided, so it also removed the preceding 0 from 09. You might want to do one of the following instead:
a = "2017-11-27T09:59:57.278-06:00,,\"[0.2094101093721778, -65.0, -76.0]\""
b = a.replace("2017-11-27T","")
OR
b = ''.join(a.split("2017-11-27T")[1:])
Output (for both)
'09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"'
If you have different dates though (and hardcoding usually is a bad practice anyways), you probably want to convert that segment of the string as a datetime object and represent it again in the string:
t = a.split(",")
t[0] = datetime.datetime.strftime(datetime.datetime.strptime(t[0][0:-6], "%Y-%m-%dT%H:%M:%S.%f"),"%H:%M:%S.%f")
b = ''.join(t)
The best way though if it's intended for your DataFrame, is probably just interpret the date with pandas. See this link for more details.
You could try this
import pandas as pd
lin = '2017-11-27T09:59:57.278-06:00,,"[0.2094101093721778, -65.0, -76.0]"\n 2017-11-27T10:00:17.250-06:00,,"[0.13055123127828835, -62.0, -76.0]"'
chrToReplace = [',,','[',']','"',',']
y =[]
# Iterate through your lines
for x in lin.splitlines():
for c in chrToReplace:
if c in x:
x = x.replace(c," ")
x= x.split()
n = 0
z ={}
for elm in x:
z.update({"V"+str(n):elm})
n += 1
y.append(z)
df = pd.DataFrame(y)
print(df)
This gives you
V0 V1 V2 V3
0 2017-11-27T09:59:57.278-06:00 0.2094101093721778 -65.0 -76.0
1 2017-11-27T10:00:17.250-06:00 0.13055123127828835 -62.0 -76.0
Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
t, p = zip(*out)
but it t is
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
maybe this is most strightforward (most pythonic i guess):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
out.index.name=None
maybe:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
I believe you want this:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of #CT ZHU and that of #Siraj S.
Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of #CT ZHU performs much faster when the size of input grows.
Example:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,
>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)
Thus, you could do this,
>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
t r
0 -1.776982 0.088249
1 0.105437 0.916905
2 -1.720683 0.098178
3 1.006216 0.324349
I am currently using this code:
import pandas as pd
AllDays = ['a','b','c','d']
TempDay = pd.DataFrame( np.random.randn(4,2) )
TempDay['Dates'] = AllDays
TempDay.to_csv('H:\MyFile.csv', index = False, header = False)
But when it prints it prints the array before the dates with a header row. I am seeking to print the dates before the TemperatureArray and no header rows.
Edit:
The file is with the TemperatureArray followed by Dates: [ TemperatureArray, Date].
-0.27724356949570034,-0.3096554106726788,a
-0.10619546908708237,0.07430127684522048,b
-0.07619665345406437,0.8474460146082116,c
0.19668718143436803,-0.8072994364484335,d
I am looking to print: [ Date TemperatureArray]
a,-0.27724356949570034,-0.3096554106726788
b,-0.10619546908708237,0.07430127684522048
c,-0.07619665345406437,0.8474460146082116
d,0.19668718143436803,-0.8072994364484335
The pandas.Dataframe.to_csv method has a keyword argument, header=True that can be turned off to disable headers.
However, it sometimes does not work (from experience).
Using it in conjunction with index=False should solve your issue.
For example, this snippet should fix your issue:
TempDay.to_csv('C:\MyFile.csv', index=False, header=False)
Here is a full example showing how it disables the header row:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(6,4))
>>> df
0 1 2 3
0 1.295908 1.127376 -0.211655 0.406262
1 0.152243 0.175974 -0.777358 -1.369432
2 1.727280 -0.556463 -0.220311 0.474878
3 -1.163965 1.131644 -1.084495 0.334077
4 0.769649 0.589308 0.900430 -1.378006
5 -2.663476 1.010663 -0.839597 -1.195599
>>> # just assigns sequential letters to the column
>>> df[4] = [chr(i+ord('A')) for i in range(6)]
>>> df
0 1 2 3 4
0 1.295908 1.127376 -0.211655 0.406262 A
1 0.152243 0.175974 -0.777358 -1.369432 B
2 1.727280 -0.556463 -0.220311 0.474878 C
3 -1.163965 1.131644 -1.084495 0.334077 D
4 0.769649 0.589308 0.900430 -1.378006 E
5 -2.663476 1.010663 -0.839597 -1.195599 F
>>> # here we reindex the headers and return a copy
>>> # using this form of indexing just requires you to provide
>>> # a list with all the columns you desire and in the order desired
>>> df2 = df[[4, 1, 2, 3]]
>>> df2
4 1 2 3
0 A 1.127376 -0.211655 0.406262
1 B 0.175974 -0.777358 -1.369432
2 C -0.556463 -0.220311 0.474878
3 D 1.131644 -1.084495 0.334077
4 E 0.589308 0.900430 -1.378006
5 F 1.010663 -0.839597 -1.195599
>>> df2.to_csv('a.txt', index=False, header=False)
>>> with open('a.txt') as f:
... print(f.read())
...
A,1.1273756275298716,-0.21165535441591588,0.4062624848191157
B,0.17597366083826546,-0.7773584823122313,-1.3694320591723093
C,-0.556463084618883,-0.22031139982996412,0.4748783498361957
D,1.131643603259825,-1.084494967896866,0.334077296863368
E,0.5893080536600523,0.9004299653290818,-1.3780062860066293
F,1.0106633581546611,-0.839597332636998,-1.1955992812601897
If you need to dynamically adjust the columns, and move the last column to the first, you can do as follows:
# this returns the columns as a list
columns = df.columns.tolist()
# removes the last column, the newest one you added
tofirst_column = columns.pop(-1)
# just move it to the start
new_columns = [tofirst_column] + columns
# then you can the rest
df2 = df[new_columns]
This simply allows you to take the current column list, construct a Python list from the current columns, and reindex the headers without having any prior knowledge on the headers.