I have a dataframe that come from SharePoint (Microsoft), and it has a lot of jsons inside the cells with the metadata. i usually dont work with json, so im struggling with it.
# df sample
+-------------+----------+
| Id | Event |
+-------------+----------+
| 105 | x |
+-------------+----------+
x = {"#odata.type":"#Microsoft.Azure.Connectors.SharePoint.SPListExpandedReference","Id":1,"Value":"Digital Training"}
How i assign just the value "Digital Training" to the cell, for example? remembering that this is ocurring for a lot of columns, and i need to solve it too. Thanks.
If the event column consists of dict-object:
df['Value'] = df.apply(lambda x: x['Event']['Value'], 1)
If the event column has string objects:
import json
df['Value'] = df.apply(lambda x: json.loads(x['Event'])['Value'], 1)
Both result in
Id Event Value
0 x {"#odata.type":"#Microsoft.Azure.Connectors.Sh... Digital Training
Related
I am learning Spark, and I am trying to create a column of the difference in days between a date and a cutoff value.
Here is some data along with my solution using pandas.
lst = ['2018-11-21',
'2018-11-01',
'2018-10-09',
'2018-11-23',
'2018-11-08',
'2018-10-06',
'2018-11-27',
'2018-10-07',
'2018-10-23',
'2018-11-02']
d = pd.DataFrame({'event':np.arange(len(lst)),'ts':lst})
d['ts'] = d['ts'].apply(pd.to_datetime) # only needed because I have alist of strings
d['new_ts'] = d.ts - (d.ts.max() - pd.to_timedelta(15, unit='d'))
Unfortunately I can't find a way to adapt this logic to pyspark. I think the issue is in the subtraction of a static date that is not part of the DataFrame.
Assuming that df is the "Spark version" of the above dataset "d", here is one of the things I tried:
calculator = udf(lambda x: datediff(datediff(date_sub(max(x),30),x)))
c = df.withColumn('Recency',calculator(col('ts')))
However, he followings give me a long error
c.select(col('Recency')).show(1)
c.show(1)
Thanks in advance to everyone who is gonna help.
The logic is:
Compute max date.
Subtract given number of days to get cutoff date.
Find difference in days from cutoff date.
df = spark.createDataFrame(data=[["2018-11-21"],["2018-11-01"],["2018-10-09"],["2018-11-23"],["2018-11-08"],["2018-10-06"],["2018-11-27"],["2018-10-07"],["2018-10-23"],["2018-11-02"]], schema=["ts"])
df = df.withColumn("ts", F.to_date("ts", "yyyy-MM-dd"))
cutoff_dt = df.select(F.date_sub(F.max("ts"), 15).alias("cutoff_dt")).first().asDict()["cutoff_dt"]
df = df.withColumn("new_ts", F.datediff("ts", F.lit(cutoff_dt)))
df.show(truncate=False)
+----------+------+
|ts |new_ts|
+----------+------+
|2018-11-21|9 |
|2018-11-01|-11 |
|2018-10-09|-34 |
|2018-11-23|11 |
|2018-11-08|-4 |
|2018-10-06|-37 |
|2018-11-27|15 |
|2018-10-07|-36 |
|2018-10-23|-20 |
|2018-11-02|-10 |
+----------+------+
I want to fill the 'references' column in df_out with the 'ID' if the corresponding 'my_ID' in df_sp is contained in df_jira 'reference_ids'.
import pandas as pd
d_sp = {'ID': [1,2,3,4], 'my_ID': ["my_123", "my_234", "my_345", "my_456"], 'references':["","","2",""]}
df_sp = pd.DataFrame(data=d_sp)
d_jira = {'my_ID': ["my_124", "my_235", "my_346"], 'reference_ids': ["my_123, my_234", "", "my_345"]}
df_jira = pd.DataFrame(data=d_jira)
df_new = df_jira[~df_jira["my_ID"].isin(df_sp["my_ID"])].copy()
df_out = pd.DataFrame(columns=df_sp.columns)
needed_cols = list(set(df_sp.columns).intersection(df_new.columns))
for column in needed_cols:
df_out[column] = df_new[column]
df_out['Related elements_my'] = df_jira['reference_ids']
Desired output df_out:
| ID | my_ID | references |
|----|-------|------------|
| | my_124| 1, 2 |
| | my_235| |
| | my_346| 3 |
What I tried so far is list comprehension, but I only managed to get the reference_ids "copied" from a helper column to my 'references' column with this:
for row, entry in df_out.iterrows():
cpl_ids = [x for x in entry['Related elements_my'].split(', ') if any(vh_id == x for vh_id in df_cpl_list['my-ID'])]
df_out.at[row, 'Related elements'] = ', '.join(cpl_ids)
I can not wrap my head around on how to get the specific 'ID's on the matches of 'any()' or if this actually the way to go as I need all the matches, not something if there is any match.
Any hints are appreciated!
I work with python 3.9.4 on Windows (adding in case python 3.10 has any other solution)
Backstory: Moving data from Jira to MS SharePoint lists. (Therefore, the 'ID' does not equal the actual index in the dataframe, but is rather assigned by SharePoint upon insertion into the list. Hence, empty after running for the new entries.)
ref_df = df_sp[["ID","my_ID"]].set_index("my_ID")
df_out.references = df_out["Related elements_my"].apply(lambda x: ",".join(list(map(lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID), x.split(",")))))
df_out[["ID","my_ID","references"]]
output:
ID my_ID references
0 NaN my_124 1,2
1 NaN my_235
2 NaN my_346 3
what is map?
map is something like [func(i) for i in lst] and apply func on all variables of lst but in another manner that increase speed.
and you can read more about this: https://realpython.com/python-map-function/
but, there, our function is : lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID)
so, if y, or y.strip() there and just for remove spaces, is empty, maps to empty: "" if y == "" like my_234
otherwise, locate y in df_out and get corresponding ID, i.e maps each my_ID to ID
hope to be helpfull :)
I'm trying to run a transformation function in a pyspark script:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0")
...
dataframe = datasource0.toDF()
...
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
#to_long(df, ["A"])
....
df = to_long(dataframe, ["Name","Type"])
My dataset looks like this:
Name |01/01(FRI)|01/02(SAT)|
ALZA CZ| 0 | 0
CLPA CZ| 1 | 5
My desired output is something like this:
Name |Type | Date. |Value |
ALZA CZ|New | 01/01(FRI) | 0
CLPA CZ|New | 01/01(FRI) | 1
ALZA CZ|Old | 01/02(SAT) | 1
CLPA CZ|Old | 01/02(SAT) | 5
However, the last code line gives me an error similar to this:
AnalysisException: Cannot resolve 'Name' given input columns 'col10'
When I check:
df.show()
I see 'col1', 'col2' etc in the first row instead of the actual labels ( ["Name","Type"] ). Should I separately remove and then add the original column titles?
It seems like that your meta data table was configured using the built-in CSV classifier. If this classifier isn't able to detect a header, it will call the columns col1, col2 etc.
Your problem lies one stage before your ETL job, so in my opinion you shouldn't remove and re-add the original column titles, but fix your data import / schema detection, by using a custom classifier.
I have a df such as below ( 3 rows for example )
ID | Dollar_Value
C 45.32
E 5.21
V 121.32
When I view the df in my notebook such as df:
It shows the Dollar_value as
ID | Dollar_Value
C 8.493000e+01
E 2.720000e+01
V 1.720000e+01
Instead of the regular format, but when I try to filter the df for specific ID, it shows the values as they are supposed to be ( 82.23 or 2.45)
df[df['ID'] == 'E']
ID | Dollar_Value
E 45.32
is there something I have to do formatting wise? So the df itself can display the value column as its supposed to?
Thanks!
You can try run this code before print , since you columns may have big number or very small number.(Check with df.describe())
pd.set_option('display.float_format', lambda x: '%.3f' % x)
I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?