I have the following data:
data = {
'employee' : ['Emp1', 'Emp2', 'Emp3', 'Emp4', 'Emp5'],
'code' : ['2018_1', '2018_3', '2019_1', '2019_2', '2017_1'],
}
old_salary_bonus = 3000
new_salary_bonus = {
'2019_1': 1000,
'2019_2': 980,
}
df = pd.DataFrame(data)
Task: Add df['salary_bonus'] column based on the following condition:
If employee's code contains '2019', use code value to retrieve salary bonus value from new_salary_bonus, else use old_salary_bonus value.
Expected Output:
employee code salary_bonus
0 Emp1 2018_1 3000
1 Emp2 2018_3 3000
2 Emp3 2019_1 1000
3 Emp4 2019_2 980
4 Emp5 2017_1 3000
Please help.
Use Series.map with Series.fillna for repplace non matched values:
import pandas as pd
data = {
'employee' : ['Emp1', 'Emp2', 'Emp3', 'Emp4', 'Emp5'],
'code' : ['2018_1', '2018_3', '2019_1', '2019_2', '2017_1'],
}
old_salary_bonus = 3000
new_salary_bonus = {
'2019_1': 1000,
'2019_2': 980,
}
df = pd.DataFrame(data)
df['salary_bonus'] = df['code'].map(new_salary_bonus).fillna(old_salary_bonus)
print (df)
employee code salary_bonus
0 Emp1 2018_1 3000.0
1 Emp2 2018_3 3000.0
2 Emp3 2019_1 1000.0
3 Emp4 2019_2 980.0
4 Emp5 2017_1 3000.0
Another solution with get with default value if not matched:
df['salary_bonus'] = df['code'].map(lambda x: new_salary_bonus.get(x, old_salary_bonus))
You can use the code below:
df['salary_bonus'] = old_salary_bonus
df.loc[df['code'].isin(list(new_salary_bonus)), 'salary_bonus'] = list(new_salary_bonus.values())
Related
I'm writing a function to filter tweet data that contains search word.
Here's my code:
def twitter_filter(df, search):
coun = 0
date_ls = []
id_ls = []
content_ls = []
lan_ls = []
name_ls = []
retweet_ls = []
cleaned_tweet_ls = []
for i, row in df.iterrows():
if search in row.cleaned_tweet:
date_ls.append(row.date)
id_ls.append(row.id)
content_ls.append(row.content)
lan_ls.append(row.language)
name_ls.append(row.name)
retweet_ls.append(row.retweet)
cleaned_tweet_ls.append(row.cleaned_tweet)
new_dict = {
"date": date_ls,
"id": id_ls,
"content": content_ls,
"lan" : lan_ls,
"name" : name_ls,
"retweet" : retweet_ls,
"cleaned_tweeet": cleaned_tweet_ls,
}
new_df = pd.DataFrame(new_dict)
return new_df
Before filter:
cleandf['name']
Out[6]:
0 PryZmRuleZZ
1 Arbitration111
2 4kjweed
3 THEREALCAMOJOE
5 DailyBSC_
130997 Rabbitdogebsc
130999 gmtowner
131000 topcryptostats
131001 vGhostvRiderv
131002 gmtowner
Name: name, Length: 98177, dtype: object
After filter, user's name becomes random integer:
cleanedogetweet['name']
Out[7]:
0 3
1 5
2 9
3 12
4 34
80779 130997
80780 130999
80781 131000
80782 131001
80783 131002
Name: name, Length: 80784, dtype: int64
This problem only happened in user's name columns, other columns that contains string are ok.
I expected to remain the original user name, how can i solve the problem ?
In pandas dataframes, each row has an attribute called name.
You can use the name attribute to get the name of the row. By default, the name of the row is the index of the row.
So it's better that your column name would not be name because it will conflict with the name attribute of the row.
You can use the rename method to rename the column name and use another name like username, or you can change your function to this:
def twitter_filter(df, search):
coun = 0
date_ls = []
id_ls = []
content_ls = []
lan_ls = []
name_ls = []
retweet_ls = []
cleaned_tweet_ls = []
for i, row in df.iterrows():
if search in row.cleaned_tweet:
date_ls.append(row['date'])
id_ls.append(row['id'])
content_ls.append(row['content'])
lan_ls.append(row['language'])
name_ls.append(row['name'])
retweet_ls.append(row['retweet'])
cleaned_tweet_ls.append(row['cleaned_tweet'])
new_dict = {
"date": date_ls,
"id": id_ls,
"content": content_ls,
"lan": lan_ls,
"user_name": name_ls,
"retweet": retweet_ls,
"cleaned_tweeet": cleaned_tweet_ls,
}
new_df = pd.DataFrame(new_dict)
return new_df
I'm trying to flatten this json response into a pandas dataframe to export to csv.
It looks like this:
j = [
{
"id": 401281949,
"teams": [
{
"school": "Louisiana Tech",
"conference": "Conference USA",
"homeAway": "away",
"points": 34,
"stats": [
{"category": "rushingTDs", "stat": "1"},
{"category": "puntReturnYards", "stat": "24"},
{"category": "puntReturnTDs", "stat": "0"},
{"category": "puntReturns", "stat": "3"},
],
}
],
}
]
...Many more items in the stats area.
If I run this and flatten to the teams level:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
I get:
school conference homeAway points stats
0 Louisiana Tech Conference USA away 34 [{'category': 'rushingTDs', 'stat': '1'}, {'ca...
How do I flatten it twice so that all of the stats are on their own column in each row?
If I do this:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
multiple_level_data = multiple_level_data.explode('stats').reset_index(drop=True)
multiple_level_data=multiple_level_data.join(pd.json_normalize(multiple_level_data.pop('stats')))
I end up with multiple rows instead of more columns:
You can try:
df = pd.DataFrame(j).explode("teams")
df = pd.concat([df, df.pop("teams").apply(pd.Series)], axis=1)
df["stats"] = df["stats"].apply(lambda x: {d["category"]: d["stat"] for d in x})
df = pd.concat(
[
df,
df.pop("stats").apply(pd.Series),
],
axis=1,
)
print(df)
Prints:
id school conference homeAway points rushingTDs puntReturnYards puntReturnTDs puntReturns
0 401281949 Louisiana Tech Conference USA away 34 1 24 0 3
can you try this:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
multiple_level_data = multiple_level_data.explode('stats').reset_index(drop=True)
multiple_level_data=multiple_level_data.join(pd.json_normalize(multiple_level_data.pop('stats')))
#convert rows to columns.
multiple_level_data=multiple_level_data.set_index(multiple_level_data.columns[0:4].to_list())
dfx=multiple_level_data.pivot_table(values='stat',columns='category',aggfunc=list).apply(pd.Series.explode).reset_index(drop=True)
multiple_level_data=multiple_level_data.reset_index().drop(['stat','category'],axis=1).drop_duplicates().reset_index(drop=True)
multiple_level_data=multiple_level_data.join(dfx)
Output:
school
conference
homeAway
points
puntReturnTDs
puntReturnYards
puntReturns
rushingTDs
0
Louisiana Tech
Conference USA
away
34
0
24
3
1
Instead of calling explode() on an output of a json_normalize(), you can explicitly pass the paths to the meta data for each column in a single json_normalize() call. For example, ['teams', 'school'] would be one path, ['teams', 'conference'] is another path, etc. This will create a long dataframe similar to what you already have.
Then you can call pivot() to reshape this output into the correct shape.
# normalize json
df = pd.json_normalize(
j, record_path=['teams', 'stats'],
meta=['id', *(['teams', c] for c in ('school', 'conference', 'homeAway', 'points'))]
)
# column name contains 'teams' prefix; remove it
df.columns = [c.split('.')[1] if '.' in c else c for c in df]
# pivot the intermediate result
df = (
df.astype({'points': int, 'id': int})
.pivot(['id', 'school', 'conference', 'homeAway', 'points'], 'category', 'stat')
.reset_index()
)
# remove index name
df.columns.name = None
df
I have some variables and a dictionary of strings and google sheets imported:
grad_year = '2029'
df_dict = {'grade_1': grade_1_class_2029,
'grade_2': grade_2_class_2029,
'grade_3': grade_3_class_2029,
'grade_4': grade_4_class_2029,
'grade_5': grade_5_class_2029}
I then turn the google sheets into dataframes, naming them dynamically:
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records(
rows[2:], columns=rows[1]
)
Now I would like to reference them without a pre-created dictionary of their names.
There is still a bunch of stuff I would like to do to the new dataframes such as deleting blank rows. I have tried:
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records(
rows[2:], columns=rows[1]
)
vars()["df_" + key + "_class_" + grad_year].replace("", nan_value, inplace=True)
vars()["df_" + key + "_class_" + grad_year].dropna(
subset=["Last Name"], inplace=True
)
and
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = (
pd.DataFrame.from_records(rows[2:], columns=rows[1])
.replace("", nan_value, inplace=True)
.dropna(subset=["Last Name"], inplace=True)
)
but neither worked.
If you replace nan_value by pd.NA (Pandas 1.0.0 and beyond), your first code snippet works fine:
import pandas as pd
grad_year = "2029"
vars()[f"df_{grad_year}"] = pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"name": {0: "John", 1: "Jack", 2: "", 3: "Butch"},
}
)
vars()[f"df_{grad_year}"].replace("", pd.NA, inplace=True)
vars()[f"df_{grad_year}"].dropna(subset=["name"], inplace=True)
print(vars()[f"df_{grad_year}"])
# Outputs
class name
0 class1 John
1 class2 Jack
3 class4 Butch
In your second code snippet, you also have to set inplace to False instead of True both times in order for chain assignments to work:
vars()[f"df_{grad_year}"] = (
pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"name": {0: "John", 1: "Jack", 2: "", 3: "Butch"},
}
)
.replace("", pd.NA, inplace=False)
.dropna(subset=["name"], inplace=False)
)
print(vars()[f"df_{grad_year}"])
# Output
class name
0 class1 John
1 class2 Jack
3 class4 Butch
I have an example json data file which has the following structure:
{
"Header": {
"Code1": "abc",
"Code2": "def",
"Code3": "ghi",
"Code4": "jkl",
},
"TimeSeries": {
"2020-11-25T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
},
"2020-11-26T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
}
}
}
When I parse this into databricks with command:
df = spark.read.json("/FileStore/test.txt")
I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema:
Date
UnitPrice
Amount
As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically:
def flatten_json(data):
columnlist = data.select("TimeSeries.*")
count = 0
for name in data.select("TimeSeries.*"):
df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a"))
df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a"))
if count == 0:
df3 = df1.join(df2, on=['join'], how="inner")
else:
df3 = df3.union(df1.join(df2, on=['join'], how="inner"))
count = count + 1
return(df3)
This is far from ideal. Does anyone know a better method to create the described dataframe?
The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.
Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.
Step 5: Cross join with the Header row.
import pyspark.sql.functions as F
header_df = df.select("Header.*")
timeseries_df = df.select("TimeSeries.*")
fieldNames = enumerate(timeseries_df.schema.fieldNames())
cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]
combined = explode(array(cols)).alias("comb")
timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')
result = header_df.crossJoin(timeseries)
result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+
|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|
+-----+-----+-----+-----+-------------------------+------+---------+
|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 |
|abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |
+-----+-----+-----+-----+-------------------------+------+---------+
I want to merge this two data frames (with no common columns) one next to each other. The two dataframes look like this:
df1:
10.74,5.71,5.41
11.44,6.1,5.87
df2:
10.17,6.58,5.23
9.99,5.75,5.13
11.21,6.35,5.72
10.3,5.86,5.12
I am trying with:
df_total=pd.concat([df1,df2],axis=1)
But the result looks something like this:
Access grade global,Grade_global,Regression_global,Access grade,Grade,Regression
,,,10.74,5.71,5.41
,,,11.44,6.1,5.87
10.17,6.58,5.23,,,
9.99,5.75,5.13,,,
11.21,6.35,5.72,,,
10.3,5.86,5.12,,,
And I what to have something like this:
10.17,6.58,5.23,10.74,5.71,5.41
9.99,5.75,5.13,11.44,6.1,5.87
11.21,6.35,5.72
10.3,5.86,5.12
The two things I want to know how to do are:
1- How can I merge the 2 data frames so that the values are next to each other (the number of rows should, therefore, be the maximum number of rows between the two dataframes; 4 in this case).
2- How to avoid having NaN (you can see that in the end there are multiple commas). (I want to avoid this because afterwards in the scatter plot I use, all the Nan are plotted as 0 (so I have a line of dots in y=0)).
The Nan values are generating zeros. Please see the result:
The html snipped is:
<div style="line-height:77%;"><br></div>
<div id="grade_access_hs"></div>
<div style="line-height:77%;"><br></div>
<p>The lines that best approximate the expected grades according to the access grade to University and comparing all students with {{user.hsname}}' students are:</p>
<div style="line-height:30%;"><br></div>
<div id="equation3"></div>
<div style="line-height:30%;"><br></div>
<div id="equation4"></div>
<script type="text/javascript" src="../static/scripts/grade_access_hs.js"></script>
All the chart:
<script>
'use strict';
var Grade_access_hs = c3.generate({
bindto: '#grade_access_hs',
data: {
url: '../static/CSV/Chart_data/grades_access_hs.csv',
xs: {
Grade_global: 'Access grade global',
Grade: 'Access grade',
Regression_global: 'Access grade global',
Regression: 'Access grade'
},
types: {
Grade_global:'scatter',
Grade:'scatter',
Regression_global: 'line',
Regression: 'line'
},
},
axis: {
y: {
label: {
text: "Average grade",
position: "outer-middle"
},
min: 1,
max: 9,
tick: {outer: false}
},
x: {
label: {
text: "Access grade PAU",
position: "outer-center"
},
min: 9,
max: 14,
tick: {
outer: false,
count:1,
fit:false,
values: [9,10,11,12,13,14]
}
}
},
size: {
height: 400,
width: 800
},
zoom: {
enabled: true
},
legend: {
show: true,
position: 'inset',
inset: {
anchor: 'top-right',
x: 20,
y: 20
}
},
})
d3.csv('../static/CSV/Chart_data/grades_access_hs.csv',function(data){
var d1 = data[0];
var d2 = data[1];
var b = (1-(d2['Regression_global']/d1['Regression_global']))/((d1['Access grade global']-d2['Access grade global'])/d1['Regression_global'])
var a = d1['Regression_global'] - (b * d1['Access grade global'])
b = (Math.round(b*1000)/1000);
a = (Math.round(a*1000)/1000);
document.getElementById("equation3").innerHTML = "Global: Grade = " + a + "·x + " + b;
var d = (1-(d2['Regression']/d1['Regression']))/((d1['Access grade private']-d2['Access grade private'])/d1['Regression'])
var c = d1['Regression'] - (b * d1['Access grade'])
d = (Math.round(d*1000)/1000);
c = (Math.round(c*1000)/1000);
document.getElementById("equation4").innerHTML = "Specific high school: Grade = " + c + "·x + " + d;
})
</script>
With grades_acess_hs.csv:
Access grade global,Grade_global,Regression_global,Access grade,Grade,Regression
,,,10.74,5.71,5.41
,,,11.44,6.1,5.87
,,,11.21,6.35,5.72
,,,10.3,5.86,5.12
10.17,6.58,5.23,,,
9.99,5.75,5.13,,,
10.96,5.84,5.71,,,
9.93,6.12,5.09,,,
9.93,6.0,5.09,,,
11.21,6.22,5.86,,,
11.28,6.1,5.9,,,
,,,10.93,6.08,5.54
Thanks in advance!
I think you need a join and fillna
print(df2.join(df1).fillna(''))
10.17 6.58 5.23 10.74 5.71 5.41
9.99 5.75 5.13 11.44 6.1 5.87
11.21 6.35 5.72
10.30 5.86 5.12
Without giving it too much thought:
if df1.shape[0] > df2.shape[0]:
new_rows = df1.shape[0] - df2.shape[0]
df3 = pd.DataFrame(np.zeros((new_rows, df2.shape[1])))
df2 = df2.append(df3)
new_df = pd.concat((df1, df2), axis=1)
#Alternative elif goes here doing the converse.