I want to fill the 'references' column in df_out with the 'ID' if the corresponding 'my_ID' in df_sp is contained in df_jira 'reference_ids'.
import pandas as pd
d_sp = {'ID': [1,2,3,4], 'my_ID': ["my_123", "my_234", "my_345", "my_456"], 'references':["","","2",""]}
df_sp = pd.DataFrame(data=d_sp)
d_jira = {'my_ID': ["my_124", "my_235", "my_346"], 'reference_ids': ["my_123, my_234", "", "my_345"]}
df_jira = pd.DataFrame(data=d_jira)
df_new = df_jira[~df_jira["my_ID"].isin(df_sp["my_ID"])].copy()
df_out = pd.DataFrame(columns=df_sp.columns)
needed_cols = list(set(df_sp.columns).intersection(df_new.columns))
for column in needed_cols:
df_out[column] = df_new[column]
df_out['Related elements_my'] = df_jira['reference_ids']
Desired output df_out:
| ID | my_ID | references |
|----|-------|------------|
| | my_124| 1, 2 |
| | my_235| |
| | my_346| 3 |
What I tried so far is list comprehension, but I only managed to get the reference_ids "copied" from a helper column to my 'references' column with this:
for row, entry in df_out.iterrows():
cpl_ids = [x for x in entry['Related elements_my'].split(', ') if any(vh_id == x for vh_id in df_cpl_list['my-ID'])]
df_out.at[row, 'Related elements'] = ', '.join(cpl_ids)
I can not wrap my head around on how to get the specific 'ID's on the matches of 'any()' or if this actually the way to go as I need all the matches, not something if there is any match.
Any hints are appreciated!
I work with python 3.9.4 on Windows (adding in case python 3.10 has any other solution)
Backstory: Moving data from Jira to MS SharePoint lists. (Therefore, the 'ID' does not equal the actual index in the dataframe, but is rather assigned by SharePoint upon insertion into the list. Hence, empty after running for the new entries.)
ref_df = df_sp[["ID","my_ID"]].set_index("my_ID")
df_out.references = df_out["Related elements_my"].apply(lambda x: ",".join(list(map(lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID), x.split(",")))))
df_out[["ID","my_ID","references"]]
output:
ID my_ID references
0 NaN my_124 1,2
1 NaN my_235
2 NaN my_346 3
what is map?
map is something like [func(i) for i in lst] and apply func on all variables of lst but in another manner that increase speed.
and you can read more about this: https://realpython.com/python-map-function/
but, there, our function is : lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID)
so, if y, or y.strip() there and just for remove spaces, is empty, maps to empty: "" if y == "" like my_234
otherwise, locate y in df_out and get corresponding ID, i.e maps each my_ID to ID
hope to be helpfull :)
Related
I'm trying to run a transformation function in a pyspark script:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0")
...
dataframe = datasource0.toDF()
...
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
#to_long(df, ["A"])
....
df = to_long(dataframe, ["Name","Type"])
My dataset looks like this:
Name |01/01(FRI)|01/02(SAT)|
ALZA CZ| 0 | 0
CLPA CZ| 1 | 5
My desired output is something like this:
Name |Type | Date. |Value |
ALZA CZ|New | 01/01(FRI) | 0
CLPA CZ|New | 01/01(FRI) | 1
ALZA CZ|Old | 01/02(SAT) | 1
CLPA CZ|Old | 01/02(SAT) | 5
However, the last code line gives me an error similar to this:
AnalysisException: Cannot resolve 'Name' given input columns 'col10'
When I check:
df.show()
I see 'col1', 'col2' etc in the first row instead of the actual labels ( ["Name","Type"] ). Should I separately remove and then add the original column titles?
It seems like that your meta data table was configured using the built-in CSV classifier. If this classifier isn't able to detect a header, it will call the columns col1, col2 etc.
Your problem lies one stage before your ETL job, so in my opinion you shouldn't remove and re-add the original column titles, but fix your data import / schema detection, by using a custom classifier.
I have a dataframe that come from SharePoint (Microsoft), and it has a lot of jsons inside the cells with the metadata. i usually dont work with json, so im struggling with it.
# df sample
+-------------+----------+
| Id | Event |
+-------------+----------+
| 105 | x |
+-------------+----------+
x = {"#odata.type":"#Microsoft.Azure.Connectors.SharePoint.SPListExpandedReference","Id":1,"Value":"Digital Training"}
How i assign just the value "Digital Training" to the cell, for example? remembering that this is ocurring for a lot of columns, and i need to solve it too. Thanks.
If the event column consists of dict-object:
df['Value'] = df.apply(lambda x: x['Event']['Value'], 1)
If the event column has string objects:
import json
df['Value'] = df.apply(lambda x: json.loads(x['Event'])['Value'], 1)
Both result in
Id Event Value
0 x {"#odata.type":"#Microsoft.Azure.Connectors.Sh... Digital Training
I have a dataframe like this:
Index| Labels -----------------------------------------|Text
1. |[(Task, add), (Application, MAMP) (Task, delete)] | "add or delete"
2. |[(Servername, abhs)] | "servername create"
Is it possible to split it in this:
Index | Task | Application | Servername |Text
1. |add, delete|MAMP | |"add or delete"
2. | | | abhs |"servername create"
Basically Labels is a list with multiple tuple. The first entry of the tuple is the key while the second entry is the value. I want the key as the column name and the value as the value of this row. In case there is another value with the same key they should be added together.
Other columns should be empty if no key is in the row.
df = pd.DataFrame()
df["Index"] = ["1.","2."]
df["Labels"] =[[("Task","add"),("Application", "MAMP"), ("Task", "delete")],[("Servername", "abhs")]]
df["Text"] = ["add or delete", "servername create"]
At the moment I'm looking for ")" and split but I think that shouldn't be that complicated and there has to be another way then use them as Strings.
But I think there is a better way.
You can do it as follow:
df = pd.DataFrame()
df["Index"] = ["1.","2."]
df["Labels"] =[[("Task","add"),("Application", "MAMP"), ("Task", "delete")],[("Servername", "abhs")]]
df["Text"] = ["add or delete", "servername create"]
def get_labels(labels):
label_dict = {}
for label, value in labels:
label_dict[label] = label_dict.get(label,[]) + [value]
for key in label_dict:
label_dict[key] = ", ".join(label_dict[key]) # in case you want it as string and not as list
return label_dict
df = df.merge(df["Labels"].apply(lambda s: pd.Series(get_labels(s))),
left_index=True, right_index=True).drop(["Labels"],axis=1).fillna('')
print(df)
Output:
Index Text Application Servername Task
0 1. add or delete MAMP add, delete
1 2. servername create abhs
As mention in the comment you don't need the second for loop if you would delete it then the output would look like this (what I think is better for future use but that is up to you):
Index Text Task Application Servername
0 1. add or delete [add, delete] [MAMP]
1 2. servername create [abhs]
I am trying to apply a function to a column in my df and add 4 new columns based on the returned list.
Here is the function that returns the list.
def separateReagan(data):
block = None
township = None
section = None
acres = None
if 'BLK' in data:
patern = r'BLK (\d{1,3})'
blockList = re.findall(patern,data)
if blockList:
block = blockList[0]
else:
patern = r'B-([0-9]{1,3})'
blockList = re.findall(patern,data)
if blockList:
block = blockList[0]
# Similar for others
return [block,township,section,acres]
And here is the code with the dataframe.
df = df[['ID','Legal Description']]
# Dataframe looks like this
# ID Legal Description
# 0 1 143560 CLARKSON | ENDEAVOR ENERGY RESO | A- ,B...
# 1 2 143990 CLARKSON ESTATE | ENDEAVOR ENERGY RESO ...
# 2 3 144420 CLARKSON RANCH | ENDEAVOR ENERGY RESO |...
df[['Block','Township','Section','Acres']] = df.apply(lambda x: separateReagan(x['Legal Description']),axis=1)
I get this error:
KeyError: "['Block' 'Township' 'Section' 'Acres'] not in index"
Tried returning a tupple instead of list, didn't work.
I threw together a small suggestion real quick that may be what you're looking for. Let me know if this helps.
from pandas import DataFrame
import re
def separate_reagan(row):
# row is a single row from the dataframe which is what is passed in
# from df.apply(fcn, axis=1)
# note: this means that you can also set values on the row
# switch local variables to setting row in dataframe if you
# really want to initialize them. If they are missing they should
# just become some form of NaN or None though depending on the dtype
row['township'] = None
row['section'] = None
row['acres'] = None
row['block'] = None
# grab legal description here instead of passing it in as the only variable
data = row['legal_description']
if 'BLK' in data:
block_list = re.search(r'BLK (\d{1,3})', data)
if block_list:
row['block'] = block_list.group(1)
else:
# since you only seem to want the first match,
# search is probably more what you're looking for
block_list = re.search(r'B-([0-9]{1,3})', data)
if block_list:
row['block'] = block_list.group(1)
# Similar for others
# returns the modified row.
return row
df = DataFrame([
{'id': 1, 'legal_description': '43560 CLARKSON | ENDEAVOR ENERGY RESO | A- ,B...'},
{'id': 2, 'legal_description': '4143990 CLARKSON ESTATE | ENDEAVOR ENERGY RESO ...'},
{'id': 3, 'legal_description': '144420 CLARKSON RANCH | ENDEAVOR ENERGY RESO |...'},
])
df = df[['id','legal_description']]
# df now only has columns ID and Legal Description
# This left hand side gets the columns from the dataframe, but as mentioned in the comment
# above, those columns in not contained in the dataframe. Also they aren't returned from the
# apply function because you never set them in separateReagan
df = df.apply(separate_reagan, axis=1)
# now these columns exist because you set them in the function
print(df[['block','township','section','acres']])
I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))