Merging two rows into one based on common field

Merging two rows into one based on common field - python

I have dataframe with the following data:
+----------+------------+-------------+---------------+----------+
|id |name |predicted |actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| null|100.10023 |2020-01-10|
| null| NirPost| 57145|null |2020-01-10|
+----------+------------+-------------+---------------+----------+
I want to merge these two rows into one, based on the name. This df is the result of a query which I've restricted to one company and single day. In the real dataset, there is 70~ companies with daily data. I want to rewrite this data into a new table as single rows.
This is the output I'd like:
+----------+------------+-------------+---------------+----------+
|id |name |predicted | actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| 57145 |100.10023 |2020-01-10|
+----------+------------+-------------+---------------+----------+
I've tried this:
df.replace('null','').groupby('name',as_index=False).agg(''.join)
However, this outputs my original df but with NaN instead of null.
`df.dtypes`:
id float64
name object
predicted float64
actual float64
yyyy_mm_dd object
dtype: object

How about you explicitly pass all the columns in the groupby with max so that it eliminates the null values?
import pandas as pd
import numpy as np
data = {'id':[215,np.nan],'name':['nirpost','nirpost'],'predicted':[np.nan,57145],'actual':[100.12,np.nan],'yyyy_mm_dd':['2020-01-10','2020-01-10']}
df = pd.DataFrame(data)
df = df.groupby('name').agg({'id':'max','predicted':'max','actual':'max','yyyy_mm_dd':'max'}).reset_index()
print(df)
Returns:
name id predicted actual yyyy_mm_dd
0 nirpost 215.0 57145.0 100.12 2020-01-10
Of course since you have more data you should probably consider adding something else in your groupby so as to not delete too many rows, but for the example data you provide, I believe this is a way to solve the issue.
EDIT:
If all columns are being named as max_original_column_name then you can simply use this:
df.columns = [x[:-4] for x in list(df)]
With the list comprehension you are creating a list that strips the last 4 characters (that is _max from each value in list(df) which is the list of the name of the columns. Last, you are assigning it with df.columns =

Related

Searching values in dataframe using re.search

I have a lot of datasets that I need to iterate through, search for specific value and return some values based on search outcome.
Datasets are stored as dictionary:
key type size Value
df1 DataFrame (89,10) Column names:
df2 DataFrame (89,10) Column names:
df3 DataFrame (89,10) Column names:
Each dataset looks something like this, and I am trying to look if value in column A row 1 has 035 in it and return B column.
| A | B | C
02 la 035 nan nan
Target 7 5
Warning 3 6
If I try to search for specific value in it I get an error
TypeError: first argument must be string or compiled pattern
I have tried:
something = []
for key in df:
text = df[key]
if re.search('035', text):
something.append(text['B'])
Something = pd.concat([something], axis=1)

You can use .str.contains() : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
df = pd.DataFrame({
"A":["02 la 035", "Target", "Warning"],
"B":[0,7,3],
"C":[0, 5, 6]
})
df[df["A"].str.contains("035")] # Returns first row only.
Also works for regex.
df[df["A"].str.contains("[TW]ar")] # Returns last two rows.
EDIT to answer additional details.
The dataframe I set up looks like this:
To extract column B for those rows which match the last regex pattern I used, amend the last line of code to:
df[df["A"].str.contains("[TW]ar")]["B"]
This returns a series. Output:
Edit 2: I see you want a dataframe at the end. Just use:
df[df["A"].str.contains("[TW]ar")]["B"].to_frame()

Is there a way in Pandas to apply .rename() to a filtered subset of a dataframe?

I have a dataframe, df, with a standard wide format:
df:
'state' | 'population' | 'region'
0 'CA' | 10000 | 'west'
1 'UT' | 6000 | 'west'
2 'NY' | 8500 | 'east'
I need to be able to rename certain values in the state column that match some conditions I've set. For example, I need to rename cases of 'NY' to 'New York' if the region variable matches 'east'. I'd like to avoid slicing and concatenating the dataframe back together.
I've tried subsetting the dataframe using the code below, but the rename doesn't seem to apply properly.
region_filter = df['region'] == 'east'
df[region_filter] = df.loc[region_filter, 'state'].rename({'NY': 'New York'})

Rename should only be applied when trying to change axes labels. Pandas' replace() function is meant for mapping dataframe values.
Also, line two should read df.loc[region_filter, 'state'] = df.loc[region_filter, 'state'].rename({'NY': 'New York'}) to avoid a shape mismatch error.

How to fetch a column name

How can we apply conditions for a Dataset in python, specially applying those and want to fetch the column name as an output?
let's say the below one is the dataframe so my question is how can we retrieve a colname(let's say "name") as an output by applying conditions on this dataframe
name salary jobDes
______________________________________
store1 | daniel | 50k | datascientist
store2 | paladugu | 55k | datascientist
store3 | theodor | 53k | dataEngineer
fetch a column name as a result like let's say "name"

Elaborated:
import pandas as pd
data = {'name':['daniel', 'paladugu', 'theodor'], 'jobDes':['datascientist', 'datascientist', 'dataEngineer']}
df = pd.DataFrame(data)
print(df['name']) # just that easy
OUTPUT:
0 daniel
1 paladugu
2 theodor
Name: name, dtype: object

Presuming you are using either pandas or dask, you should be able to get a list of column names with
df.columns.
This means that if you wish to know what the first column is called you can index it (these start with 0 for the first element because python/c) as usual with df.columns[0] etc.
If you then wish to access all the data in it, you can use
df[df.columns[0]] or the actual column name df['name'].

If your data frame is named df, df.columns returns a list of all of the column names.

PySpark: list column names based on characters in values

In PySpark, I am trying to clean a dataset. Some of the columns have unwanted characters (=" ") in it's values. I read the dataset as a DataFrame and I already created a User Defined Function which can remove the characters successfully, but now I am struggling to write a script which can identify on which columns I need to perform the UserDefinedFunction. I only use the last row of the dataset, assuming the columns always contains similar entries.
DataFrame (df):
id value1 value2 value3
="100010" 10 20 ="30"
In Python, the following works:
columns_to_fix = []
for col in df:
value = df[col][0]
if type(value) == str and value.startswith('='):
columns_to_fix.append(col)
I tried the following in PySpark, but this returns all the column names:
columns_to_fix = []
for x in df.columns:
if df[x].like('%="'):
columns_to_fix.append(x)
Desired output:
columns_to_fix: ['id', 'value3']
Once I have the column names in a list, I can use a for loop to fix the entries in the columns. I am very new to PySpark, so my apologies if this is a too basic question. Thank you so much in advance for your advice!

"I only use the last row of the dataset, assuming the columns always contains similar entries." Under that assumption, you could collect a single row and test if the character you are looking for is in there.
Also, note that you do not need a udf to replace = in your columns, you can use regexp_replace. A working example is given below, hope this helps!
import pyspark.sql.functions as F
df = spark.createDataFrame([['=123','456','789'], ['=456','789','123']], ['a', 'b','c'])
df.show()
# +----+---+---+
# | a| b| c|
# +----+---+---+
# |=123|456|789|
# |=456|789|123|
# +----+---+---+
# list all columns with '=' in it.
row = df.limit(1).collect()[0].asDict()
columns_to_replace = [i for i,j in row.items() if '=' in j]
for col in columns_to_replace:
df = df.withColumn(col, F.regexp_replace(col, '=', ''))
df.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# |123|456|789|
# |456|789|123|
# +---+---+---+

PYSPARK set default value in duplicated column name

In pyspark, I have a dataframe with 10 columns like this :
id, last_name, first_name, manager, shop, location, manager, place, country, status
i would like to set a default value to only the first column manager, i've tried with :
df.withColumn("manager", "x1")
but it gives me an error for ambiguous reference as there is 2 columns with the same name.
Is there a way to do it without renaming the column ?

One work around can be,to recreate the dataframe changing the column names. It's always better to have unique column names.
>>> df = spark.createDataFrame([('value1','value2'),('value3','value4')],['manager','manager'])
>>> df.show()
+-------+-------+
|manager|manager|
+-------+-------+
| value1| value2|
| value3| value4|
+-------+-------+
>>> df1 = df.toDF('manager1','manager2')
>>> from pyspark.sql.functions import lit
>>> df1.withColumn('manager1',lit('x1')).show()
+--------+--------+
|manager1|manager2|
+--------+--------+
| x1| value2|
| x1| value4|
+--------+--------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two rows into one based on common field - python

Related

Searching values in dataframe using re.search

Is there a way in Pandas to apply .rename() to a filtered subset of a dataframe?

How to fetch a column name

PySpark: list column names based on characters in values

PYSPARK set default value in duplicated column name

Categories

Resources