I have two data frames with each having about 250K lines. I am trying to do a fuzzy lookup between the two data frame's columns. After look up I need the indexes for those good matches with the threshold.
Some details are following,
My df1:
Name State Zip combined_1
0 Auto MN 10 Auto,MN,10
1 Rtla VI 253 Rtla,VI,253
2 Huka CO 56218 Huka,CO,56218
3 kann PR 214 Kann,PR,214
4 Himm NJ 65216 Himm,NJ,65216
5 Elko NY 65418 Elko,NY,65418
6 Tasm MA 13 Tasm,MA,13
7 Hspt OH 43218 Hspt,OH,43218
My other data frame that I am trying to look upto
Name State Zip combined_2
0 Kilo NC 69521 Kilo,NC,69521
1 Kjhl FL 3369 Kjhl,FL,3369
2 Rtla VI 25301 Rtla,VI,25301
3 Illt GA 30024 Illt,GA,30024
4 Huka CO 56218 Huka,CO,56218
5 Haja OH 96766 Haja,OH,96766
6 Auto MN 010 Auto,MN,010
7 Clps TX 44155 Clps,TX,44155
If you look close, when I do fuzzy lookup I should get a good match for indexes 0 and 2 in my df1 from df2 indexes, 6, 4.
So, I did this,
from fuzzywuzzy import fuzz
# Save df1 index
df_1index = []
# save df2 index
df2_indexes = []
# save fuzzy ratio
fazz_rat = []
for index, details in enumerate(df1['combined_1']):
for ind, information in enumerate(df2['combined_2']):
fuzmatch = fuzz.ratio(str(details), str(information))
if fuzmatch >= 94:
df_1index.append(index)
df2_indexes.append(ind)
fazz_rat.append(fuzmatch)
else:
pass
As I expected, I am getting the results for this example case,
df_1index
>> [0,2]
df2_indexes
>> [6,4]
To run against 250K * 250K lines in both data frames it takes, so much time.
How can I speed up this lookup process? Is there pandas or python way to improve performance for what want?
Related
It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:
I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters.
I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.
something like:
t (ns)
cluster
71
1524957248.4375
1
72
1524957265.625
1
699
14624846476.5625
2
700
14624846653.125
2
701
14624846661.287
2
1161
25172864926.5625
3
1160
25172864935.9375
3
Thanks for your help!
Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output:
t (ns) cluster
71 1.524957e+09 1
72 1.524957e+09 1
699 1.462485e+10 2
700 1.462485e+10 2
701 1.462485e+10 2
1161 2.517286e+10 3
1160 2.517286e+10 3
You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
)
)
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
I have a DF[number] = pd.read_html(url.text)
I want to concantante or join the DF lists theres hundreads of e.g. DFs[400] into a single pandas dataframe
the dataframes are in list format so list of lists but python index lists like pandas dataframe
[ Vessel Built GT DWT Size (m) Unnamed: 5
0 x XIN HUA Bulk Carrier 2012 44543 82269 229 x 32
1 b FRANCESCO CORRADO Bulk Carrier 2008 40154 77061 225 x 32
2 5 NAN XIN 17 Bulk Carrier 2001 40570 75220 225 x 32
3 p DIAMOND INDAH Bulk Carrier 2002 43321 77830 229 x 37
4 NaN PRIME LILY Bulk Carrier 2012 44485 81507 229 x 32
5 s EVGENIA Bulk Carrier 2011 92183 176000 292 x 45
df[number] = pd.read_html(url.text)
for number in range(494):
df=pd.concat(df[number])
methods but that doesn't seem to work
df1=pd.concat(df[1])
df2=pd.concat(df[2])
df3=pd.concat(df[3])
dfx=pd.concat([df1,df2,df3],ignore_index=True)
this is not what I want as there is hundreads of [] python list dataframes
I want one pandas dataframe that joins all of the list dataframes into one
just be clear the df[] container of the lists is a dict type while df[1] is list
You can use list comprehension:
pd.concat([dfs[i] for i in range(len(dfs))])
I currently have a massive set of datasets. I have a set for each year in the 2000's. I take a combination of three years and run a code on that to clean.
The problem is that due to the size I can't run my cleaning code on it as my Memory runs out.
I was thinking about splitting the data using something like:
df.ix[1,N/x]
Where N is the total amount of rows in my dataframe. I think I should replace the dataframe to clear up the memory being used. This does mean I have to load in the dataframe first for each chunk I create.
There are several problems:
How do I get N when N can be different for each year ?
The operation requires that groups of data stay together.
Is there a way to make x vary with the size of N?
Is all of this highly inefficient/is there an efficient inbuild function for this?
Dataframe looks like:
ID Location year other variables
1 a 2006
1 a 2007
1 b 2006
1 a 2005
2 c 2005
2 c 2007
3 d 2005
What I need is for all the same ID's to stay together.
The data to be cut in managable chunks, dependent on a yearly changing total amount of data.
In this case it would be:
ID Location year other variables
1 a 2006
1 a 2007
1 b 2006
1 a 2005
ID Location year other variables
2 c 2005
2 c 2007
3 d 2005
The data originates from a csv by year. So all 2005 data comes from 2005csv, 2006 data from 2006csv etc.
The csv's are loaded into memory and concatenated to form one set of three years.
The individual csv files have the same setup as indicated above. So each observation is stated with an ID, location and year, followed by a lot of other variables.
Running it just on a group by group bases would be a bad idea, as there are thousands, if not millions of these ID's. They can have dozens of locations and a maximum of three years. All of this needs to stay together.
Loops on this many rows take ages in my experience
I was thinking maybe something along the lines of:
create a variable that counts the number of groups
use the maximum of this count variable and divide it by 4 or 5.
cut the data up in chunks this way
Not sure if this would be efficient, even less sure how to execute it.
One way to achieve this would be like as follows:
import pandas as pd
# generating random DF
num_rows = 100
locs = list('abcdefghijklmno')
df = pd.DataFrame(
{'id': np.random.randint(1, 100, num_rows),
'location': np.random.choice(locs, num_rows),
'year': np.random.randint(2005, 2007, num_rows)})
df.sort_values('id', inplace=True)
print('**** sorted DF (first 10 rows) ****')
print(df.head(10))
# chopping DF into chunks ...
chunk_size = 5
chunks = [i for i in df.id.unique()[::chunk_size]]
chunk_margins = [(chunks[i-1],chunks[i]) for i in range(1, len(chunks))]
df_chunks = [df.ix[(df.id >= x[0]) & (df.id < x[1])] for x in chunk_margins]
print('**** first chunk ****')
print(df_chunks[0])
Output:
**** sorted DF (first 10 rows) ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
**** first chunk ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
6 16 k 2006
82 16 g 2005
Use chunked pandas by importing Blaze.
Instructions from http://blaze.readthedocs.org/en/latest/ooc.html
Naive use of Blaze triggers out-of-core systems automatically when called on large files.
d = Data('my-small-file.csv')
d.my_column.count() # Uses Pandas
d = Data('my-large-file.csv')
d.my_column.count() # Uses Chunked Pandas
How does it work?
Blaze breaks up the data resource into a sequence of chunks. It pulls one chunk into memory, operates on it, pulls in the next, etc.. After all chunks are processed it often has to finalize the computation with another operation on the intermediate results.
This is a snippet of my Data Frame in pandas
SubJob DetectorName CategoryID DefectID Image
0 0 NECK:1 79 5
1 0 NECK:2 79 6
2 0 NECK:3 92 4
3 0 NECK:4 99 123
4 0 NECK:5 99 124
5 1 NECK:6 79 47
6 1 NECK:7 91 631
7 1 NECK:8 98 646
8 1 NECK:9 99 7
9 2 NECK:10 79 15
10 2 NECK:11 89 1023
11 2 NECK:12 79 1040
12 2 NECK:13 79 2458
13 3 NECK:14 73 2459
14 3 NECK:15 87 2517
15 3 NECK:15 79 3117
16 3 NECK:16 79 3118
till n which is very large
We have multiple subjobs whichare sorted inside which we have multiple categoryId which are sorted and inside categoryId we have multiple defectId which are also sorted
I have a separate nested list
[[CategoryId, DefectId, Image-Link] [CategoryId, DefectId, Image-Link] ...m times]
m is large
here category id , defect id represents integer values and image link is string
now i repeatedly pick a categoryId, DefectId from list and find a row in dataframe corresponding to that categoryId, DefectId and add image link in that row
my current code is
for image_info_list in final_image_info_list:
# add path of image in Image_Link
frame_main.ix[(frame_main["CategoryID"].values == image_info_list[0])
&
(frame_main["DefectID"].values == image_info_list[1]),
"Image_Link"] = image_info_list[2]
which is working perfectly but my issue is since n, m is very large it is lot of time to compute it is there any other appropriate approach
can i apply binary search here ? if yes then how
For a fixed n, if m is large enough, you can perform queries more efficiently by some preprocessing.
(I would start with Idea 2 below, because Idea 1 is much more work to implement.)
Idea 1
First, sort the dataframe by [CategoryId, DefectId, Image-Link]. Following that, you can find any triplet by a triple application of a bisect algorithms, one per column, on the column's values.
The cost of what you're doing now is O(m n). The cost of my suggestion is O(n log(n) + m log(n)).
This will work better for some values of m and n, and worse for others. E.g., if m = Θ(n), then your current algorithm is Θ(n2) = ω(n log(n)). YMMV.
Idea 2
Since Image-link is a string sequence, I'm guessing pandas has a harder time searching for specific values within it. You can preprocess by making a dictionary mapping each value to a list of indices within the Dataframe. In the extreme case, where each Image-link value has O(1) rows, this can reduce the time from Θ(mn) to Θ(n + m).
Edit
In the extreme case the OP mentions in the comment, all Image-link values are unique. In this case, it is possible to build a dictionary mapping their values to indices like so:
dict([(k, i) for (i, k) in enumerate(df['Image-link'].values)])