Looping through a pandas dataframe - how to make code run faster?

Looping through a pandas dataframe - how to make code run faster? - python

I have a dataframe, df, with 43244 rows, and a txt file, text with 1107957 lines. The purpose of the following code is to evaluate entries in df, and return a word_id value if they are present in the text.
with open('text.txt') as f:
text = f.readlines()
for index, row in df.iterrows():
lemma_id = 0
for lines in range(len(text)):
word_row = text[lines].split()
if word_row[2] == row['Word']:
word_id = word_row[1]
row['ID'] = word_id
However, this code would take an estimated 120 days to complete in my jupyter notebook, and I (obviously) want it to execute a bit more efficiently.
How do I approach this? Should I convert text into a dataframe/database, or is there another more efficient approach?
EDIT
Example of dataframe structure:
Word ID
0 hello NaN
1 there NaN
Example of txt.file structure:
NR ID WORD
32224 86289 ah
32225 86290 general
32226 86291 kenobi

Have you tried using pandas.merge?
Your for loop would be replaced by the following (assuming that text is a DataFrame)
new_df = pd.merge(df, text_df, left_on='WORD', right_on='Word')
new_df.dropna(subset=['ID'], inplace=True)

Related

How to calculate the number of occurrences between data in excel?

I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file

This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!

Panda module export, split data

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)

Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Extract prefix from string in dataframe column where exists in a list

Looking for some help.
I have a pandas dataframe column and I want to extract the prefix where such prefix exists in a separate list.
pr_list = ['1 FO-','2 IA-']
Column in df is like
PartNumber
ABC
DEF
1 FO-BLABLA
2 IA-EXAMPLE
What I am looking for is to extract the prefix where present, put in a new column and leave the rest of the string in the original column.
PartNumber Prefix
ABC
DEF
BLABLA 1 FO-
EXAMPLE 2 IA-
Have tried some things like str.startswith but a bit of a python novice and wasn't able to get it to work.
much appreciated
EDIT
Both solutions below work on the test data, however I am getting an error
error: nothing to repeat at position 16
Which suggests something askew in my dataset. Not sure what position 16 refers to but looking at both the prefix list and PartNumber column in position 16 nothing seems out of the ordinary?
EDIT 2
I have traced it to have an * in the pr_list seems to be throwing it. is * some reserved character? is there a way to break it out so it is read as text?

You can try:
df['Prefix']=df.PartNumber.str.extract(r'({})'.format('|'.join(pr_list))).fillna('')
df.PartNumber=df.PartNumber.str.replace('|'.join(pr_list),'')
print(df)
PartNumber Prefix
0 ABC
1 DEF
2 BLABLA 1 FO-
3 EXAMPLE 2 IA-

Maybe it's not what you are looking for, but may it help.
import pandas as pd
pr_list = ['1 FO-','2 IA-']
df = pd.DataFrame({'PartNumber':['ABC','DEF','1 FO-BLABLA','2 IA-EXAMPLE']})
extr = '|'.join(x for x in pr_list)
df['Prefix'] = df['PartNumber'].str.extract('('+ extr + ')', expand=False).fillna('')
df['PartNumber'] = df['PartNumber'].str.replace('|'.join(pr_list),'')
df

python csv subcombinations columns

I have csv file like this:
F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,label
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,L1
b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,L2
I want to have the combination of columns into the file as follows:
-For a column combination with the label and write the expected results to:
file1:
F1,label
a1,L1
b1,L2
file2:
F2,label
a2,L1
b2,L2
until
file10:
F10,label
a10,L1
b10,L2
-For 2 column combinations with the label and write the expected results to:
2C_file1:
F1,F2,label
a1,a2,L1
b1,b2,L2
2C_file2:
F1,F3,label
a1,a3,L1
b1,b3,L2
until
45C_file45:
F9,F10,label
a9,a10,L1
b9,b10,L2
-For 3 columns combinations with the label and write to 120 files:
.....until.....
-For 9 columns combinations with the label and write to 10 files:
-For 10 columns combinations with the label and write to 1 files:
I have searched and I found a python code for string combination with itertool.
How could I archive above tasks with python code?
import itertools as iters
text='ABCDEFGHIJ'
C1= iters.combinations(text,1)
print list(C1)
C2= iters.combinations(text,2)
print list(C2)
.....
C9= iters.combinations(text,9)
print list(C9)
C10=iters.combinations(text,10)
print list(10)

This loop structure should be able to create the structure you would like to have. It has to be changed to write to files. Here the length of the generated sequence or the position of the output that was generated is part of the same structure that you would like to print into a file:
#!/usr/bin/env python
row1 = ["F{i}".format(i=i) for i in range(1,11)]
row1.append("label")
row2 = ["a{i}".format(i=i) for i in range(1,11)]
row2.append("L1")
row3 = ["b{i}".format(i=i) for i in range(1,11)]
row3.append("L2")
for SequenceLength in range(1, len(row1)):
for SequencePositionStart in range(len(row1)):
if row1[SequencePositionStart:SequenceLength] == []:
continue
print(','.join(row1[SequencePositionStart:SequenceLength]), row1[-1], sep=",")
print(','.join(row2[SequencePositionStart:SequenceLength]), row2[-1], sep=",")
print(','.join(row3[SequencePositionStart:SequenceLength]), row3[-1], sep=",")

Exporting max values of different csv files in to one

I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)

Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))

You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looping through a pandas dataframe - how to make code run faster? - python

Have you tried using pandas.merge? Your for loop would be replaced by the following (assuming that text is a DataFrame) new_df = pd.merge(df, text_df, left_on='WORD', right_on='Word') new_df.dropna(subset=['ID'], inplace=True)

Related

How to calculate the number of occurrences between data in excel?

Panda module export, split data

Extract prefix from string in dataframe column where exists in a list

python csv subcombinations columns

Exporting max values of different csv files in to one

Categories

Resources