Uniting two data frames - python

I want to merge this two data frames (with no common columns) one next to each other. The two dataframes look like this:
df1:
10.74,5.71,5.41
11.44,6.1,5.87
df2:
10.17,6.58,5.23
9.99,5.75,5.13
11.21,6.35,5.72
10.3,5.86,5.12
I am trying with:
df_total=pd.concat([df1,df2],axis=1)
But the result looks something like this:
Access grade global,Grade_global,Regression_global,Access grade,Grade,Regression
,,,10.74,5.71,5.41
,,,11.44,6.1,5.87
10.17,6.58,5.23,,,
9.99,5.75,5.13,,,
11.21,6.35,5.72,,,
10.3,5.86,5.12,,,
And I what to have something like this:
10.17,6.58,5.23,10.74,5.71,5.41
9.99,5.75,5.13,11.44,6.1,5.87
11.21,6.35,5.72
10.3,5.86,5.12
The two things I want to know how to do are:
1- How can I merge the 2 data frames so that the values are next to each other (the number of rows should, therefore, be the maximum number of rows between the two dataframes; 4 in this case).
2- How to avoid having NaN (you can see that in the end there are multiple commas). (I want to avoid this because afterwards in the scatter plot I use, all the Nan are plotted as 0 (so I have a line of dots in y=0)).
The Nan values are generating zeros. Please see the result:
The html snipped is:
<div style="line-height:77%;"><br></div>
<div id="grade_access_hs"></div>
<div style="line-height:77%;"><br></div>
<p>The lines that best approximate the expected grades according to the access grade to University and comparing all students with {{user.hsname}}' students are:</p>
<div style="line-height:30%;"><br></div>
<div id="equation3"></div>
<div style="line-height:30%;"><br></div>
<div id="equation4"></div>
<script type="text/javascript" src="../static/scripts/grade_access_hs.js"></script>
All the chart:
<script>
'use strict';
var Grade_access_hs = c3.generate({
bindto: '#grade_access_hs',
data: {
url: '../static/CSV/Chart_data/grades_access_hs.csv',
xs: {
Grade_global: 'Access grade global',
Grade: 'Access grade',
Regression_global: 'Access grade global',
Regression: 'Access grade'
},
types: {
Grade_global:'scatter',
Grade:'scatter',
Regression_global: 'line',
Regression: 'line'
},
},
axis: {
y: {
label: {
text: "Average grade",
position: "outer-middle"
},
min: 1,
max: 9,
tick: {outer: false}
},
x: {
label: {
text: "Access grade PAU",
position: "outer-center"
},
min: 9,
max: 14,
tick: {
outer: false,
count:1,
fit:false,
values: [9,10,11,12,13,14]
}
}
},
size: {
height: 400,
width: 800
},
zoom: {
enabled: true
},
legend: {
show: true,
position: 'inset',
inset: {
anchor: 'top-right',
x: 20,
y: 20
}
},
})
d3.csv('../static/CSV/Chart_data/grades_access_hs.csv',function(data){
var d1 = data[0];
var d2 = data[1];
var b = (1-(d2['Regression_global']/d1['Regression_global']))/((d1['Access grade global']-d2['Access grade global'])/d1['Regression_global'])
var a = d1['Regression_global'] - (b * d1['Access grade global'])
b = (Math.round(b*1000)/1000);
a = (Math.round(a*1000)/1000);
document.getElementById("equation3").innerHTML = "Global: Grade = " + a + "·x + " + b;
var d = (1-(d2['Regression']/d1['Regression']))/((d1['Access grade private']-d2['Access grade private'])/d1['Regression'])
var c = d1['Regression'] - (b * d1['Access grade'])
d = (Math.round(d*1000)/1000);
c = (Math.round(c*1000)/1000);
document.getElementById("equation4").innerHTML = "Specific high school: Grade = " + c + "·x + " + d;
})
</script>
With grades_acess_hs.csv:
Access grade global,Grade_global,Regression_global,Access grade,Grade,Regression
,,,10.74,5.71,5.41
,,,11.44,6.1,5.87
,,,11.21,6.35,5.72
,,,10.3,5.86,5.12
10.17,6.58,5.23,,,
9.99,5.75,5.13,,,
10.96,5.84,5.71,,,
9.93,6.12,5.09,,,
9.93,6.0,5.09,,,
11.21,6.22,5.86,,,
11.28,6.1,5.9,,,
,,,10.93,6.08,5.54
Thanks in advance!

I think you need a join and fillna
print(df2.join(df1).fillna(''))
10.17 6.58 5.23 10.74 5.71 5.41
9.99 5.75 5.13 11.44 6.1 5.87
11.21 6.35 5.72
10.30 5.86 5.12

Without giving it too much thought:
if df1.shape[0] > df2.shape[0]:
new_rows = df1.shape[0] - df2.shape[0]
df3 = pd.DataFrame(np.zeros((new_rows, df2.shape[1])))
df2 = df2.append(df3)
new_df = pd.concat((df1, df2), axis=1)
#Alternative elif goes here doing the converse.

Related

How to reference a dynamically created dataframe in a for loop?

I have some variables and a dictionary of strings and google sheets imported:
grad_year = '2029'
df_dict = {'grade_1': grade_1_class_2029,
'grade_2': grade_2_class_2029,
'grade_3': grade_3_class_2029,
'grade_4': grade_4_class_2029,
'grade_5': grade_5_class_2029}
I then turn the google sheets into dataframes, naming them dynamically:
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records(
rows[2:], columns=rows[1]
)
Now I would like to reference them without a pre-created dictionary of their names.
There is still a bunch of stuff I would like to do to the new dataframes such as deleting blank rows. I have tried:
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records(
rows[2:], columns=rows[1]
)
vars()["df_" + key + "_class_" + grad_year].replace("", nan_value, inplace=True)
vars()["df_" + key + "_class_" + grad_year].dropna(
subset=["Last Name"], inplace=True
)
and
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = (
pd.DataFrame.from_records(rows[2:], columns=rows[1])
.replace("", nan_value, inplace=True)
.dropna(subset=["Last Name"], inplace=True)
)
but neither worked.
If you replace nan_value by pd.NA (Pandas 1.0.0 and beyond), your first code snippet works fine:
import pandas as pd
grad_year = "2029"
vars()[f"df_{grad_year}"] = pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"name": {0: "John", 1: "Jack", 2: "", 3: "Butch"},
}
)
vars()[f"df_{grad_year}"].replace("", pd.NA, inplace=True)
vars()[f"df_{grad_year}"].dropna(subset=["name"], inplace=True)
print(vars()[f"df_{grad_year}"])
# Outputs
class name
0 class1 John
1 class2 Jack
3 class4 Butch
In your second code snippet, you also have to set inplace to False instead of True both times in order for chain assignments to work:
vars()[f"df_{grad_year}"] = (
pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"name": {0: "John", 1: "Jack", 2: "", 3: "Butch"},
}
)
.replace("", pd.NA, inplace=False)
.dropna(subset=["name"], inplace=False)
)
print(vars()[f"df_{grad_year}"])
# Output
class name
0 class1 John
1 class2 Jack
3 class4 Butch

Return Next Pages of Results Based on Tag in API

I am trying to come up with a script that loops and returns all results from an API. The max transactions per call is 500, and there is a tag 'MoreFlag' that is 0 when there are less than or equal to 500 transactions and 1 when there are more than 500 transactions (per page). How can I write the code so that when 'MoreFlag' is 1 go to the next page until the tag changes to 0?
The API requires a license key and password, but here's a piece of the output.
r = 0
station_name = 'ORANGE'
usageSearchQuery = {
'stationName': station_name,
'startRecord': 1 + r,
'numTransactions': 500
}
trans_data = client.service.getTransactionData(usageSearchQuery)
for c in enumerate(trans_data):
print(c)
This returns the following:
(0, 'responseCode')
(1, 'responseText')
(2, 'transactions')
(3, 'MoreFlag')
Next, if I use this code:
for c in enumerate(trans_data.transactions):
print(trans_data)
# add 500 to startRecord
The API returns:
{
'responseCode': '100',
'responseText': 'API input request executed successfully.',
'transactions': {
'transactionData': [
{
'stationID': '1’,
'stationName': 'ORANGE',
'transactionID': 178543,
'Revenue': 1.38,
'companyID': ‘ABC’,
'recordNumber': 1
},
{
'stationID': '1’,
'stationName': 'ORANGE',
'transactionID': 195325,
'Revenue': 1.63,
'companyID': ‘ABC’,
'recordNumber': 2
},
{
'stationID': '1’,
'stationName': 'ORANGE',
'transactionID': 287006,
'Revenue': 8.05,
'companyID': ‘ABC’,
'recordNumber': 500
}
]
},
'MoreFlag': 1
}
The idea is to pull data from trans_data.transactions.transactionData, but I'm getting tripped up when I need more than 500 results, i.e. subsequent pages.
I figured it out. I guess my only question: is there a cleaner way to do this? It seems kind of repetitive.
i = 1
y = []
lr = 0
station_name = 'ORANGE'
usageSearchQuery = {
'stationName': station_name,
}
trans_data = client.service.getTransactionData(usageSearchQuery)
for c in enumerate(trans_data):
while trans_data.MoreFlag == 1:
usageSearchQuery = {
'stationName': station_name,
'startRecord': 1 + lr,
'numTransactions': 500
}
trans_data = client.service.getTransactionData(usageSearchQuery)
for (d) in trans_data.transactions.transactionData:
td = [i, str(d.stationName), d.transactionID,
d.transactionTime.strftime('%Y-%m-%d %H:%M:%S'),
d.Revenue]
i = i + 1
y.append(td)
lr = lr + len(trans_data.transactions.transactionData)

Databricks - Pyspark - Handling nested json with a dynamic key

I have an example json data file which has the following structure:
{
"Header": {
"Code1": "abc",
"Code2": "def",
"Code3": "ghi",
"Code4": "jkl",
},
"TimeSeries": {
"2020-11-25T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
},
"2020-11-26T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
}
}
}
When I parse this into databricks with command:
df = spark.read.json("/FileStore/test.txt")
I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema:
Date
UnitPrice
Amount
As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically:
def flatten_json(data):
columnlist = data.select("TimeSeries.*")
count = 0
for name in data.select("TimeSeries.*"):
df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a"))
df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a"))
if count == 0:
df3 = df1.join(df2, on=['join'], how="inner")
else:
df3 = df3.union(df1.join(df2, on=['join'], how="inner"))
count = count + 1
return(df3)
This is far from ideal. Does anyone know a better method to create the described dataframe?
The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.
Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.
Step 5: Cross join with the Header row.
import pyspark.sql.functions as F
header_df = df.select("Header.*")
timeseries_df = df.select("TimeSeries.*")
fieldNames = enumerate(timeseries_df.schema.fieldNames())
cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]
combined = explode(array(cols)).alias("comb")
timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')
result = header_df.crossJoin(timeseries)
result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+
|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|
+-----+-----+-----+-----+-------------------------+------+---------+
|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 |
|abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |
+-----+-----+-----+-----+-------------------------+------+---------+

How to filter an existing queryset by the month instead of day

I have a queryset that displays the amount of followers for each day of the month.
Views:
context["brandtwitterFollowerCounts"] = (
TwitterFollowerCount.objects.filter(
twitter_account=self.object.twitter_account, followers__gt=0
)
.distinct("created__date")
.order_by("created__date")
)
context["brandTwitterFollowCreated"] = [
i.created.strftime("%d %m") for i in context["brandtwitterFollowerCounts"]
]
context["brandTwitterFollowers"] = [
i.followers for i in context["brandtwitterFollowerCounts"]
Dataset:
var newDataObjectTwo = {
labels: {{ brandTwitterFollowCreated|safe }},
datasets: [{
label: 'Daily',
backgroundColor: 'rgb(255, 255, 255)',
fill: false,
borderColor: 'rgb(29, 161, 242)',
data: {{ brandTwitterFollowers|safe }}
}],
}
I would like to filter this by months instead now. So as there are currently 3 months (08, 09, 10), the graph should only show 3 data points. Sorry if this doesn't make sense, I am not sure what the best way to explain it is.

How to fill column with values based on string contains condition

I have the following data:
data = {
'employee' : ['Emp1', 'Emp2', 'Emp3', 'Emp4', 'Emp5'],
'code' : ['2018_1', '2018_3', '2019_1', '2019_2', '2017_1'],
}
old_salary_bonus = 3000
new_salary_bonus = {
'2019_1': 1000,
'2019_2': 980,
}
df = pd.DataFrame(data)
Task: Add df['salary_bonus'] column based on the following condition:
If employee's code contains '2019', use code value to retrieve salary bonus value from new_salary_bonus, else use old_salary_bonus value.
Expected Output:
employee code salary_bonus
0 Emp1 2018_1 3000
1 Emp2 2018_3 3000
2 Emp3 2019_1 1000
3 Emp4 2019_2 980
4 Emp5 2017_1 3000
Please help.
Use Series.map with Series.fillna for repplace non matched values:
import pandas as pd
data = {
'employee' : ['Emp1', 'Emp2', 'Emp3', 'Emp4', 'Emp5'],
'code' : ['2018_1', '2018_3', '2019_1', '2019_2', '2017_1'],
}
old_salary_bonus = 3000
new_salary_bonus = {
'2019_1': 1000,
'2019_2': 980,
}
df = pd.DataFrame(data)
df['salary_bonus'] = df['code'].map(new_salary_bonus).fillna(old_salary_bonus)
print (df)
employee code salary_bonus
0 Emp1 2018_1 3000.0
1 Emp2 2018_3 3000.0
2 Emp3 2019_1 1000.0
3 Emp4 2019_2 980.0
4 Emp5 2017_1 3000.0
Another solution with get with default value if not matched:
df['salary_bonus'] = df['code'].map(lambda x: new_salary_bonus.get(x, old_salary_bonus))
You can use the code below:
df['salary_bonus'] = old_salary_bonus
df.loc[df['code'].isin(list(new_salary_bonus)), 'salary_bonus'] = list(new_salary_bonus.values())

Categories

Resources