I have GridDB Python client running on my Ubuntu computer. I would like to get the columns having null values using a GridDM query. I know it’s possible to get the rows with null values but I want columns this time.
Take for an example, the timeseries table below:
'''
-- | timestamp | value1 | value2 | value3 | output |
-- |---------------------|--------|--------|---------|--------|
-- | 2021-06-24 12:00:22 | 1.3819 | 2.4214 | | 0 |
-- | 2021-06-25 11:55:23 | 4.8726 | 6.2324 | 9.3424 | 1 |
-- | 2021-06-26 05:40:53 | 6.1313 | | 5.4648 | 0 |
-- | 2021-06-27 08:24:19 | 6.7543 | | 9.7967 | 0 |
-- | 2021-06-28 13:34:51 | 3.5713 | 1.4452 | | 1 |
'''
The solution should basically return value2 and value3 columns. Thanks in advance!
I need to join multiple tables but I can't get the join in Python to behave as expected. I need to left join table 2 to table 1, without overwriting the existing data in the "geometry" column of table 1. What I'm trying to achieve is sort of like a VLOOKUP in Excel. I want to pull matching values from my other tables (~10) into table 1 without overwriting what is already there. Is there a better way? Below is what I tried:
TABLE 1
| ID | BLOCKCODE | GEOMETRY |
| -- | --------- | -------- |
| 1 | 123 | ABC |
| 2 | 456 | DEF |
| 3 | 789 | |
TABLE 2
| ID | GEOID | GEOMETRY |
| -- | ----- | -------- |
| 1 | 123 | |
| 2 | 456 | |
| 3 | 789 | GHI |
TABLE 3 (What I want)
| ID | BLOCKCODE | GEOID | GEOMETRY |
| -- | --------- |----- | -------- |
| 1 | 123 | 123 | ABC |
| 2 | 456 | 456 | DEF |
| 3 | | 789 | GHI |
What I'm getting
| ID | GEOID | GEOMETRY_X | GEOMETRY_Y |
| -- | ----- | -------- | --------- |
| 1 | 123 | ABC | |
| 2 | 456 | DEF | |
| 3 | 789 | | GHI |
join = pd.merge(table1, table2, how="left", left_on="BLOCKCODE", right_on="GEOID"
When I try this:
join = pd.merge(table1, table2, how="left", left_on=["BLOCKCODE", "GEOMETRY"], right_on=["GEOID", "GEOMETRY"]
I get this:
TABLE 1
| ID | BLOCKCODE | GEOMETRY |
| -- | --------- | -------- |
| 1 | 123 | ABC |
| 2 | 456 | DEF |
| 3 | 789 | |
You could try:
# rename the Blockcode column in table1 to have the same column ID as table2.
# This is necessary for the next step to work.
table1 = table1.rename(columns={"Blockcode": "GeoID",})
# Overwrites all NaN values in table1 with the value from table2.
table1.update(table2)
I'm observing odd behaviour while performing fuzzy_left_join from fuzzymatcher library. Trying to join two df, left one with 5217 records and right one with 8734, the all records with best_match_score is 71 records, which seems really odd . To achieve better results I even remove all the numbers and left only alphabetical charachters for joining columns. In the merged table the id column from the right table is NaN, which is also strange result.
left table - column for join "amazon_s3_name". First item - limonig
+------+---------+-------+-----------+------------------------------------+
| id | product | price | category | amazon_s3_name |
+------+---------+-------+-----------+------------------------------------+
| 1 | A | 1.49 | fruits | limonig |
| 8964 | B | 1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C | 2.79 | beverages | studencajfuzelimonilimtreval |
+------+---------+-------+-----------+------------------------------------+
right table - column for join "amazon_s3_name" - last item - limoni
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| id | picture | amazon_s3_name |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| 191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg | ahmadcajlimonidjindjifilxg |
| 192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg | ahmadcajlimonidjindjifilxgg |
| 204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg | ahmadcajlimonidjindjifilxgg |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg | cajstudenfuzetealimonilimonovatrevalpet |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg | lesieursalatensosslimonimaslinovomaslo |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg | limoni |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
merged table - as we can see in the merged table best_match_score is NaN
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| 0 | NaN | 0_left | None | 1.49 | Fruits | Limoni500g09700112 | NaN | limonig | NaN | NaN |
| 2 | NaN | 2_left | None | 1.69 | Bio | Morkovi1kgbr09700132 | NaN | morkovikgbr | NaN | NaN |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
You could give polyfuzz a try. Use the examples' setup, for example using TF-IDF or Bert, then run:
model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']
then merge:
df1.merge(df2, left_on='To', right_on='amazon_s3_name')
I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791
Say I have the following table:
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.72694 | 1.4742 | 0.32396 | 0.98535 | 1 | 0.83592 | 0.0046566 | 0.0039465 | 0.04779 | 0.12795 | 0.016108 | 0.0052323 | 0.00027477 | 1.1756 | 1 |
| 2 | 0.74173 | 1.5257 | 0.36116 | 0.98152 | 0.99825 | 0.79867 | 0.0052423 | 0.0050016 | 0.02416 | 0.090476 | 0.0081195 | 0.002708 | 7.48E-05 | 0.69659 | 1 |
| 3 | 0.76722 | 1.5725 | 0.38998 | 0.97755 | 1 | 0.80812 | 0.0074573 | 0.010121 | 0.011897 | 0.057445 | 0.0032891 | 0.00092068 | 3.79E-05 | 0.44348 | 1 |
| 4 | 0.73797 | 1.4597 | 0.35376 | 0.97566 | 1 | 0.81697 | 0.0068768 | 0.0086068 | 0.01595 | 0.065491 | 0.0042707 | 0.0011544 | 6.63E-05 | 0.58785 | 1 |
| 5 | 0.82301 | 1.7707 | 0.44462 | 0.97698 | 1 | 0.75493 | 0.007428 | 0.010042 | 0.0079379 | 0.045339 | 0.0020514 | 0.00055986 | 2.35E-05 | 0.34214 | 1 |
| 7 | 0.82063 | 1.7529 | 0.44458 | 0.97964 | 0.99649 | 0.7677 | 0.0059279 | 0.0063954 | 0.018375 | 0.080587 | 0.0064523 | 0.0022713 | 4.15E-05 | 0.53904 | 1 |
| 8 | 0.77982 | 1.6215 | 0.39222 | 0.98512 | 0.99825 | 0.80816 | 0.0050987 | 0.0047314 | 0.024875 | 0.089686 | 0.0079794 | 0.0024664 | 0.00014676 | 0.66975 | 1 |
| 9 | 0.83089 | 1.8199 | 0.45693 | 0.9824 | 1 | 0.77106 | 0.0060055 | 0.006564 | 0.0072447 | 0.040616 | 0.0016469 | 0.00038812 | 3.29E-05 | 0.33696 | 1 |
| 11 | 0.7459 | 1.4927 | 0.34116 | 0.98296 | 1 | 0.83088 | 0.0055665 | 0.0056395 | 0.0057679 | 0.036511 | 0.0013313 | 0.00030872 | 3.18E-05 | 0.25026 | 1 |
| 12 | 0.79606 | 1.6934 | 0.43387 | 0.98181 | 1 | 0.76985 | 0.0077992 | 0.011071 | 0.013677 | 0.057832 | 0.0033334 | 0.00081648 | 0.00013855 | 0.49751 | 1 |
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
I have two sets of row indices :
set1 = [1,3,5,8,9]
set2 = [2,4,7,10,10]
Note : Here, I have indicated the first row with index value 1. Length of both sets shall always be same.
What I am looking for is a fast and pythonic way to get the difference of column values for corresponding row indices, that is : difference of 1-2,3-4,5-7,8-10,9-10.
For this example, my resultant dataframe is the following:
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.01479 | 0.0515 | 0.0372 | 0.00383 | 0.00175 | 0.03725 | 0.0005857 | 0.0010551 | 0.02363 | 0.037474 | 0.0079885 | 0.0025243 | 0.00019997 | 0.47901 | 0 |
| 1 | 0.02925 | 0.1128 | 0.03622 | 0.00189 | 0 | 0.00885 | 0.0005805 | 0.0015142 | 0.004053 | 0.008046 | 0.0009816 | 0.00023372 | 0.0000284 | 0.14437 | 0 |
| 3 | 0.04319 | 0.1492 | 0.0524 | 0.00814 | 0.00175 | 0.05323 | 0.0023293 | 0.0053106 | 0.0169371 | 0.044347 | 0.005928 | 0.00190654 | 0.00012326 | 0.32761 | 0 |
| 3 | 0.03483 | 0.1265 | 0.02306 | 0.00059 | 0 | 0.00121 | 0.0017937 | 0.004507 | 0.0064323 | 0.017216 | 0.0016865 | 0.00042836 | 0.00010565 | 0.16055 | 0 |
| 1 | 0.05016 | 0.2007 | 0.09271 | 0.00115 | 0 | 0.06103 | 0.0022327 | 0.0054315 | 0.0079091 | 0.021321 | 0.0020021 | 0.00050776 | 0.00010675 | 0.24725 | 0 |
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
My resultant difference values are absolute here.
I cant apply diff(), since the row indices may not be consecutive.
I am currently achieving my aim via looping through sets.
Is there a pandas trick to do this?
Use loc based indexing -
df.loc[set1].values - df.loc[set2].values
Ensure that len(set1) is equal to len(set2). Also, keep in mind setX is a counter-intuitive name for list objects.
You need to select by data reindexing and then subtract:
df = df.reindex(set1) - df.reindex(set2).values
loc or iloc will raise a future warning, since passing list-likes to .loc or [] with any missing label will raise KeyError in the future.
In short, try the following:
df.iloc[::2].values - df.iloc[1::2].values
PS:
Or alternatively, if (like in your question the indices follow no simple rule):
df.iloc[set1].values - df.iloc[set2].values