›How to join tables in Python without overwriting existing column data - python

I need to join multiple tables but I can't get the join in Python to behave as expected. I need to left join table 2 to table 1, without overwriting the existing data in the "geometry" column of table 1. What I'm trying to achieve is sort of like a VLOOKUP in Excel. I want to pull matching values from my other tables (~10) into table 1 without overwriting what is already there. Is there a better way? Below is what I tried:
TABLE 1
| ID | BLOCKCODE | GEOMETRY |
| -- | --------- | -------- |
| 1 | 123 | ABC |
| 2 | 456 | DEF |
| 3 | 789 | |
TABLE 2
| ID | GEOID | GEOMETRY |
| -- | ----- | -------- |
| 1 | 123 | |
| 2 | 456 | |
| 3 | 789 | GHI |
TABLE 3 (What I want)
| ID | BLOCKCODE | GEOID | GEOMETRY |
| -- | --------- |----- | -------- |
| 1 | 123 | 123 | ABC |
| 2 | 456 | 456 | DEF |
| 3 | | 789 | GHI |
What I'm getting
| ID | GEOID | GEOMETRY_X | GEOMETRY_Y |
| -- | ----- | -------- | --------- |
| 1 | 123 | ABC | |
| 2 | 456 | DEF | |
| 3 | 789 | | GHI |
join = pd.merge(table1, table2, how="left", left_on="BLOCKCODE", right_on="GEOID"
When I try this:
join = pd.merge(table1, table2, how="left", left_on=["BLOCKCODE", "GEOMETRY"], right_on=["GEOID", "GEOMETRY"]
I get this:
TABLE 1
| ID | BLOCKCODE | GEOMETRY |
| -- | --------- | -------- |
| 1 | 123 | ABC |
| 2 | 456 | DEF |
| 3 | 789 | |

You could try:
# rename the Blockcode column in table1 to have the same column ID as table2.
# This is necessary for the next step to work.
table1 = table1.rename(columns={"Blockcode": "GeoID",})
# Overwrites all NaN values in table1 with the value from table2.
table1.update(table2)

Related

How to Convert Column into a List based on the other column in pyspark

I have a data frame in pyspark which is as follows:
| Column A | Column B |
| -------- | -------- |
| 123 | abc |
| 123 | def |
| 456 | klm |
| 789 | nop |
| 789 | qrst |
For every row in column A the column B has to be transformed into a list. The result should look like this.
| Column A | Column B |
| -------- | -------- |
| 123 |[abc,def] |
| 456 | [klm] |
| 789 |[nop,qrst]|
I have tried using map(), but it didn't give me the expected results. Can you point me in the right direction on how to approach this problem ?
Use collect_list,
from pyspark.sql import functions as F
df1.groupBy("Column A").agg(F.collect_list("Column B")).show()
Input:
Output:

Dividing a dataframe into several dataframes according to date column

I have a dataframe which contains a specific column for the date which is called 'testdate'. And I have a period between two specific date, such as 20110501~20120731.
From a dataframe, I want to divide that dataframe into multiple dataframes according to the year-month of 'testdate'.
For example, if 'testdate' is between 20110501-20110531 then df1, if 'testdate' is between next month, then f2, ... and so on.
For example, a whole dataframe looks like this...
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 1 | 20110528 | 50 |
| 2 | 20110601 | 75 |
| 3 | 20110504 | 100 |
| 4 | 20110719 | 82 |
| 5 | 20111120 | 42 |
| 6 | 20111103 | 95 |
| 7 | 20120520 | 42 |
| 8 | 20120503 | 95 |
But, I want to divide it like this...
[DF1]: name should be 201105
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 1 | 20110528 | 50 |
| 3 | 20110504 | 100 |
[DF2]: name should be 201106
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 2 | 20110601 | 75 |
[DF3]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 4 | 20110719 | 82 |
[DF4]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 5 | 20111120 | 42 |
| 6 | 20111103 | 95 |
[DF5]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 7 | 20120520 | 42 |
| 8 | 20120503 | 95 |
I found some codes for dividing a dataframe according to the quarter, but I could find any codes for my task.
How can I deal with it ? Many thanks to your help.
Create a grouper by slicing yyyymm from testdate then group the dataframe and store each group inside a dict comprehension
s = df['Testdate'].astype(str).str[:6]
dfs = {f'df_{k}': g for k, g in df.groupby(s)}
# dfs['df_201105']
StudentID Testdate Record
0 1 20110528 50
2 3 20110504 100
# dfs['df_201106']
StudentID Testdate Record
1 2 20110601 75

How to get the columns with null values in GridDB?

I have GridDB Python client running on my Ubuntu computer. I would like to get the columns having null values using a GridDM query. I know it’s possible to get the rows with null values but I want columns this time.
Take for an example, the timeseries table below:
'''
-- | timestamp | value1 | value2 | value3 | output |
-- |---------------------|--------|--------|---------|--------|
-- | 2021-06-24 12:00:22 | 1.3819 | 2.4214 | | 0 |
-- | 2021-06-25 11:55:23 | 4.8726 | 6.2324 | 9.3424 | 1 |
-- | 2021-06-26 05:40:53 | 6.1313 | | 5.4648 | 0 |
-- | 2021-06-27 08:24:19 | 6.7543 | | 9.7967 | 0 |
-- | 2021-06-28 13:34:51 | 3.5713 | 1.4452 | | 1 |
'''
The solution should basically return value2 and value3 columns. Thanks in advance!

Fuzzymatcher returns NaN for best_match_score

I'm observing odd behaviour while performing fuzzy_left_join from fuzzymatcher library. Trying to join two df, left one with 5217 records and right one with 8734, the all records with best_match_score is 71 records, which seems really odd . To achieve better results I even remove all the numbers and left only alphabetical charachters for joining columns. In the merged table the id column from the right table is NaN, which is also strange result.
left table - column for join "amazon_s3_name". First item - limonig
+------+---------+-------+-----------+------------------------------------+
| id | product | price | category | amazon_s3_name |
+------+---------+-------+-----------+------------------------------------+
| 1 | A | 1.49 | fruits | limonig |
| 8964 | B | 1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C | 2.79 | beverages | studencajfuzelimonilimtreval |
+------+---------+-------+-----------+------------------------------------+
right table - column for join "amazon_s3_name" - last item - limoni
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| id | picture | amazon_s3_name |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| 191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg | ahmadcajlimonidjindjifilxg |
| 192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg | ahmadcajlimonidjindjifilxgg |
| 204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg | ahmadcajlimonidjindjifilxgg |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg | cajstudenfuzetealimonilimonovatrevalpet |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg | lesieursalatensosslimonimaslinovomaslo |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg | limoni |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
merged table - as we can see in the merged table best_match_score is NaN
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| 0 | NaN | 0_left | None | 1.49 | Fruits | Limoni500g09700112 | NaN | limonig | NaN | NaN |
| 2 | NaN | 2_left | None | 1.69 | Bio | Morkovi1kgbr09700132 | NaN | morkovikgbr | NaN | NaN |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
You could give polyfuzz a try. Use the examples' setup, for example using TF-IDF or Bert, then run:
model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']
then merge:
df1.merge(df2, left_on='To', right_on='amazon_s3_name')

Replace Dataframe value in row with the value of the row below

I have the following frame
+-------+--------+-----+--+
| 1 | 2 | 3 | |
+-------+--------+-----+--+
| hi | banana | 123 | |
| | apple | | |
| hello | pear | 456 | |
| | orange | | |
+-------+--------+-----+--+
What is the most pythonic way of replacing the value in column 2 for each odd row with the value from the row below, i.e. having a df like
+-------+--------+-----+
| 1 | 2 | 3 |
+-------+--------+-----+
| hi | apple | 123 |
| hello | orange | 456 |
+-------+--------+-----+

Categories

Resources