How to speed up the code - searching through a dataframe takes hours - python

I've got a CSV file containing the distance between centroids in a GIS-model in the next format:
InputID,TargetID,Distance
1,2,3050.01327866
1,7,3334.99565217
1,5,3390.99115304
1,3,3613.77046864
1,4,4182.29900892
...
...
3330,3322,955927.582933
It is sorted on origin (InputID) and then on the nearest destination (TargetID).
For a specific modelling tool I need this data in a CSV file, formatted as follows (the numbers are the centroid numbers):
distance1->1, distance1->2, distance1->3,.....distance1->3330
distance2->1, distance2->2,.....
.....
distance3330->1,distance3330->2....distance3330->3330
So no InputID's or TargetID's, just the distances with the origins on the rows and the destinations on the columns:
(example for the first 5 origins/destinations)
0,3050.01327866,3613.77046864,4182.29900892,3390.99115304
3050.01327866,0,1326.94611797,1175.10254872,1814.45584129
3613.77046864,1326.94611797,0,1832.209595,3132.78725738
4182.29900892,1175.10254872,1832.209595,0,1935.55056767
3390.99115304,1814.45584129,3132.78725738,1935.55056767,0
I've built the next code, and it works. But it is so slow that running it will take days to get the 3330x3330 file. As I am a beginner in Python I think I am overlooking something...
import pandas as pd
import numpy as np
file=pd.read_csv('c:\\users\\Niels\\Dropbox\\Python\\centroid_distances.csv')
df=file.sort_index(by=['InputID', 'TargetID'], ascending=[True, True])
number_of_zones=3330
text_file = open("c:\\users\\Niels\\Dropbox\\Python\\Output.csv", "w")
for origin in range(1,number_of_zones):
output_string=''
print(origin)
for destination in range(1,number_of_zones):
if origin==destination:
distance=0
else:
distance_row=df[(df['InputID']==origin) & (df['TargetID'] == destination)]
# I guess this is the time-consuming part
distance=distance_row.iloc[0]['Distance']
output_string=output_string+str(distance)+','
text_file.write(output_string[:-1]+'\n') #strip last ',' of line
text_file.close()
Could you give me some hints to speed up this code?

IIUC, all you need is pivot. If you start from a frame like this:
df = pd.DataFrame(columns="InputID,TargetID,Distance".split(","))
df["InputID"] = np.arange(36)//6 + 1
df["TargetID"] = np.arange(36) % 6 + 1
df["Distance"] = np.random.uniform(0, 100, len(df))
df = df[df.InputID != df.TargetID]
df = df.sort(["InputID", "Distance"])
>>> df.head()
InputID TargetID Distance
2 1 3 6.407198
3 1 4 43.037829
1 1 2 52.121284
4 1 5 86.769620
5 1 6 96.703294
and we know the InputID and TargetID are unique, we can simply pivot:
>>> pv = df.pivot(index="InputID", columns="TargetID", values="Distance").fillna(0)
>>> pv
TargetID 1 2 3 4 5 6
InputID
1 0.000000 52.121284 6.407198 43.037829 86.769620 96.703294
2 53.741611 0.000000 27.555296 85.328607 59.561345 8.895407
3 96.142920 62.532984 0.000000 6.320273 37.809105 69.896308
4 57.835249 49.350647 38.660269 0.000000 7.151053 45.017780
5 72.758342 48.947788 4.212775 98.183169 0.000000 15.702280
6 32.468329 83.979431 23.578347 30.212883 82.580496 0.000000
>>> pv.to_csv("out_dist.csv", index=False, header=False)
>>> !cat out_dist.csv
0.0,52.1212839519,6.40719759732,43.0378290605,86.769620064,96.7032941473
53.7416111725,0.0,27.5552964592,85.3286070586,59.5613449796,8.89540736892
96.1429198049,62.5329836475,0.0,6.32027280686,37.8091052942,69.8963084944
57.8352492462,49.3506467609,38.6602692461,0.0,7.15105257546,45.0177800391
72.7583417281,48.9477878574,4.21277494476,98.183168992,0.0,15.7022798801
32.4683285321,83.9794307564,23.578346756,30.2128827937,82.5804959193,0.0
The reshaping section of the tutorial might be useful.

Related

Find "most used items" per "level" in big csv file with Pandas

I have a rather big csv file and I want to find out which items are used the most at a certain player level.
So one column I'm looking at has all the player levels (from 1 to 30) another column has all the item names (e.g. knife_1, knife_2, etc.) and yet another column lists backpacks (backback_1, backpack_2, etc.).
Now I want to check which is the most used knife and backpack for player level 1, for player level 2, player level 3, etc.
What I've tried was this but when I tried to verify it in Excel (with countifs) the results were different:
import pandas as pd
df = pd.read_csv('filename.csv')
#getting the columns I need:
df = df[["playerLevel", "playerKnife", "playerBackpack"]]
print(df.loc[df["playerLevel"] == 1].mode())
In my head, this should locate all the rows with playerLevel 1 and then only print out the most used items for that level. However, I wanted to double-check and used "countifs" in excel which gave me a different result.
Maybe I'm thinking too simple (or complicated) so I hope you can either verify that my code should be correct or point out the error.
I'm also looking for an easy way to then go through all levels automatically and print out the most used items for each level.
Thanks in advance.
Edit:
Dataframe example. Just imagine there are thousands of players that can range from level 1 to level 30. And especially on higher levels, they have access to a lot of knives and backpacks. So the combinations are limitless.
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
Try the following:
data = """\
index playerLevel playerKnife playerBackpack
0 1 knife_1 backpack_1
1 2 knife_2 backpack_1
2 3 knife_1 backpack_2
3 1 knife_2 backpack_1
4 2 knife_3 backpack_2
5 1 knife_1 backpack_1
6 15 knife_13 backpack_12
7 13 knife_10 backpack_9
8 1 knife_1 backpack_2
"""
import io
import pandas as pd
stream = io.StringIO(data)
df = pd.read_csv(stream, sep='\s+')
df = df.drop('index', axis='columns')
print(df.groupby('playerLevel').agg(pd.Series.mode))
yields
playerKnife playerBackpack
playerLevel
1 knife_1 backpack_1
2 [knife_2, knife_3] [backpack_1, backpack_2]
3 knife_1 backpack_2
13 knife_10 backpack_9
15 knife_13 backpack_12
Note that the result of df.groupby('playerLevel').agg(pd.Series.mode) is a DataFrame, so you can assign that result and use it as a normal dataframe.
For data plain from a CSV file, simply use
df = pd.read_csv('filename.csv')
df = df[['playerLevel, 'playerKnife', 'playerBackpack']] # or whichever columns you want
stats = df.groupby('playerLevel').agg(pd.Series.mode)) # stats will be dataframe as well

how to write a list in a file with a specific format?

I have a Python list and wanna reprint that in a special way.
input:
trend_end= ['skill1',10,0,13,'skill2',6,1,0,'skill3',5,8,9,'skill4',9,0,1]
I want to write a file like this:
output:
1 2 3
1 10 0 13
2 6 1 0
3 5 8 9
4 9 0 1
Basically, I need to do the following steps:
Separate elements of the list for each skill.
Write them in a table shape, add indices of columns and rows.
I wanna use it as an input of another software. That's why I wanna write a file.
I did this but I know it is wrong, can you see how I can fix it?
f1 = open("data.txt", "a")
for j in trend_end:
f1.write(str(j))
for i in range(1,int(len(trend_end)/df1ana.shape[0])):
G=[trend_end[i*(df1ana.shape[0]-10)- (df1ana.shape[0]-10):i*(df1ana.shape[0]-10)]]
for h in G:
f1.write(i)
f1.write(h)
f1.write('\n')
f.close()
df1ana.shape[0] is 3 in the above example. It is basically the length of data for each skill
Another option that you can try via pandas:
import pandas as pd
pd.DataFrame([trend_end[i+1:i+4] for i in range(0,len(trend_end),4)]).to_csv('data.txt', sep='\t')
OUTPUT:
0 1 2
0 10 0 13
1 6 1 0
2 5 8 9
3 9 0 1
You should iterate over the list in steps of 4, i.e. df1ana.shape[0]+1
steps = df1ana.shape[0]+1
with open("data.txt", "a") as f:
f.write(' ' + ' '.join(range(1, steps)) + '\n') # write header line
for i in range(1, len(trend_end), steps):
f.write(f"{i:<3}")
for j in range(i, i+steps-1):
f.write("f{trend_end[j]:<3}")
f.write("\n")
The :<3 formatting puts each value in a 3-character, left-aligned field.
This should work regardless of the number of groups or the number of records per group. It uses the difference in the size of the full list compared to the integer only list to calculate the number of rows you should have, and uses the ratio of the number of integers over the number of rows to get the number of columns.
import numpy as np
import pandas as pd
digits = [x for x in trend if isinstance(x,int)]
pd.DataFrame(np.reshape(digits,
(int(len(trend)-len(digits)),
int(len(digits)/(len(trend)-len(digits)))))).to_csv('output.csv')

python stacking data with missing values in the header

I have data that is imported from a csv file, in reality there are more columns and more cycles, but this is a representative snippet:
Export date 2020-10-10
Record #3 Record #2 Record #1
Cycle #5 Cycle #4 Cycle #3
time ( min.) Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3
0 0.0390625 9.89619 0.853909 14.409 10.1961 0.859037 14.4676 10.0274 0.832598
1 0.0390625 9.53452 0.949844 14.4096 10.3034 1.224 14.4676 10.0323 1.20403
2 0.0390625 9.8956 1.47227 14.4097 10.6586 1.14486 14.4676 10.4936 1.12747
3 0.0390625 10.7829 1.44412 14.4097 10.9185 1.20247 14.5116 10.6892 1.12459
The top part of the data contains a row (export date) that is not needed in the table.
I would like to stack the data so that there will be Cycle and Record columns. The problem is that these values are found only above the first column of data for every cycle. For example, Cycle5 has three columns of data, then Cycle4 has three columns of data etc.
This is how the output should look like:
I didn't get very far:
df = pd.read_csv('cycles.csv')
#Fill the names of cycles to the right
df.ffill(axis = 1, inplace = True)
#Not sure this is needed, it might make it easier to melt/stack
df.iloc[0,0] = "time ( min.)"
df.iloc[1,0] = "time ( min.)"
Thank you for your thoughts and assistance!!
There are couple of problems with this which you need to address all:
Firstly Read all the required info:
This cannot be done unless all of the info is read separately:
import pandas as pd
from io import StringIO
string = open('SO.csv').read()
records = [i.split('#')[1].strip() for i in string.split('\n')[1].split(',') if '#' in i]
cycles = [i.split('#')[1].strip() for i in string.split('\n')[2].split(',') if '#' in i]
data = pd.read_csv(StringIO(string), sep=',', header=3).dropna(how = 'any')
Rename columns so they follow a pattern:
cols = [col for col in data.columns if '.' not in col]
data = data.rename(columns = dict(zip(cols ,[col+'.0' for col in cols])))
Build a loop to pluck out records for each record and cycle:
dfs = []
for rdx, rec in enumerate(records):
df = data[['time ( min.)'].__add__([col for col in data.columns if col.endswith(str(rdx))])].rename(columns = dict(zip([col+f'.{rdx}' for col in cols],cols)))
df[['Cycle', 'Record']] = cycles[rdx], records[rdx]
dfs.append(df)
Finally Merge them all:
pd.concat(dfs)
This results in:
time ( min.) Parameter1 Something2 Whatever3 Cycle Record
0 0.0 0.039062 9.89619 0.853909 5 3
1 1.0 0.039062 9.53452 0.949844 5 3
2 2.0 0.039062 9.89560 1.472270 5 3
3 3.0 0.039062 10.78290 1.444120 5 3
0 0.0 14.409000 10.19610 0.859037 4 2
1 1.0 14.409600 10.30340 1.224000 4 2
2 2.0 14.409700 10.65860 1.144860 4 2
3 3.0 14.409700 10.91850 1.202470 4 2
0 0.0 14.467600 10.02740 0.832598 3 1
1 1.0 14.467600 10.03230 1.204030 3 1
2 2.0 14.467600 10.49360 1.127470 3 1
3 3.0 14.511600 10.68920 1.124590 3 1
Breaking down a problem in simple steps will not only help you go with this one but also in EVERY OTHER case. Just figure out what you need to do, break into steps and go with it!

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !
You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1
If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources