Uniting two data frames - python
I want to merge this two data frames (with no common columns) one next to each other. The two dataframes look like this:
df1:
10.74,5.71,5.41
11.44,6.1,5.87
df2:
10.17,6.58,5.23
9.99,5.75,5.13
11.21,6.35,5.72
10.3,5.86,5.12
I am trying with:
df_total=pd.concat([df1,df2],axis=1)
But the result looks something like this:
Access grade global,Grade_global,Regression_global,Access grade,Grade,Regression
,,,10.74,5.71,5.41
,,,11.44,6.1,5.87
10.17,6.58,5.23,,,
9.99,5.75,5.13,,,
11.21,6.35,5.72,,,
10.3,5.86,5.12,,,
And I what to have something like this:
10.17,6.58,5.23,10.74,5.71,5.41
9.99,5.75,5.13,11.44,6.1,5.87
11.21,6.35,5.72
10.3,5.86,5.12
The two things I want to know how to do are:
1- How can I merge the 2 data frames so that the values are next to each other (the number of rows should, therefore, be the maximum number of rows between the two dataframes; 4 in this case).
2- How to avoid having NaN (you can see that in the end there are multiple commas). (I want to avoid this because afterwards in the scatter plot I use, all the Nan are plotted as 0 (so I have a line of dots in y=0)).
The Nan values are generating zeros. Please see the result:
The html snipped is:
<div style="line-height:77%;"><br></div>
<div id="grade_access_hs"></div>
<div style="line-height:77%;"><br></div>
<p>The lines that best approximate the expected grades according to the access grade to University and comparing all students with {{user.hsname}}' students are:</p>
<div style="line-height:30%;"><br></div>
<div id="equation3"></div>
<div style="line-height:30%;"><br></div>
<div id="equation4"></div>
<script type="text/javascript" src="../static/scripts/grade_access_hs.js"></script>
All the chart:
<script>
'use strict';
var Grade_access_hs = c3.generate({
bindto: '#grade_access_hs',
data: {
url: '../static/CSV/Chart_data/grades_access_hs.csv',
xs: {
Grade_global: 'Access grade global',
Grade: 'Access grade',
Regression_global: 'Access grade global',
Regression: 'Access grade'
},
types: {
Grade_global:'scatter',
Grade:'scatter',
Regression_global: 'line',
Regression: 'line'
},
},
axis: {
y: {
label: {
text: "Average grade",
position: "outer-middle"
},
min: 1,
max: 9,
tick: {outer: false}
},
x: {
label: {
text: "Access grade PAU",
position: "outer-center"
},
min: 9,
max: 14,
tick: {
outer: false,
count:1,
fit:false,
values: [9,10,11,12,13,14]
}
}
},
size: {
height: 400,
width: 800
},
zoom: {
enabled: true
},
legend: {
show: true,
position: 'inset',
inset: {
anchor: 'top-right',
x: 20,
y: 20
}
},
})
d3.csv('../static/CSV/Chart_data/grades_access_hs.csv',function(data){
var d1 = data[0];
var d2 = data[1];
var b = (1-(d2['Regression_global']/d1['Regression_global']))/((d1['Access grade global']-d2['Access grade global'])/d1['Regression_global'])
var a = d1['Regression_global'] - (b * d1['Access grade global'])
b = (Math.round(b*1000)/1000);
a = (Math.round(a*1000)/1000);
document.getElementById("equation3").innerHTML = "Global: Grade = " + a + "·x + " + b;
var d = (1-(d2['Regression']/d1['Regression']))/((d1['Access grade private']-d2['Access grade private'])/d1['Regression'])
var c = d1['Regression'] - (b * d1['Access grade'])
d = (Math.round(d*1000)/1000);
c = (Math.round(c*1000)/1000);
document.getElementById("equation4").innerHTML = "Specific high school: Grade = " + c + "·x + " + d;
})
</script>
With grades_acess_hs.csv:
Access grade global,Grade_global,Regression_global,Access grade,Grade,Regression
,,,10.74,5.71,5.41
,,,11.44,6.1,5.87
,,,11.21,6.35,5.72
,,,10.3,5.86,5.12
10.17,6.58,5.23,,,
9.99,5.75,5.13,,,
10.96,5.84,5.71,,,
9.93,6.12,5.09,,,
9.93,6.0,5.09,,,
11.21,6.22,5.86,,,
11.28,6.1,5.9,,,
,,,10.93,6.08,5.54
Thanks in advance!
I think you need a join and fillna
print(df2.join(df1).fillna(''))
10.17 6.58 5.23 10.74 5.71 5.41
9.99 5.75 5.13 11.44 6.1 5.87
11.21 6.35 5.72
10.30 5.86 5.12
Without giving it too much thought:
if df1.shape[0] > df2.shape[0]:
new_rows = df1.shape[0] - df2.shape[0]
df3 = pd.DataFrame(np.zeros((new_rows, df2.shape[1])))
df2 = df2.append(df3)
new_df = pd.concat((df1, df2), axis=1)
#Alternative elif goes here doing the converse.
Related
How to reference a dynamically created dataframe in a for loop?
I have some variables and a dictionary of strings and google sheets imported: grad_year = '2029' df_dict = {'grade_1': grade_1_class_2029, 'grade_2': grade_2_class_2029, 'grade_3': grade_3_class_2029, 'grade_4': grade_4_class_2029, 'grade_5': grade_5_class_2029} I then turn the google sheets into dataframes, naming them dynamically: for key, val in df_dict.items(): rows = val.get_all_values() vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records( rows[2:], columns=rows[1] ) Now I would like to reference them without a pre-created dictionary of their names. There is still a bunch of stuff I would like to do to the new dataframes such as deleting blank rows. I have tried: for key, val in df_dict.items(): rows = val.get_all_values() vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records( rows[2:], columns=rows[1] ) vars()["df_" + key + "_class_" + grad_year].replace("", nan_value, inplace=True) vars()["df_" + key + "_class_" + grad_year].dropna( subset=["Last Name"], inplace=True ) and for key, val in df_dict.items(): rows = val.get_all_values() vars()["df_" + key + "_class_" + grad_year] = ( pd.DataFrame.from_records(rows[2:], columns=rows[1]) .replace("", nan_value, inplace=True) .dropna(subset=["Last Name"], inplace=True) ) but neither worked.
If you replace nan_value by pd.NA (Pandas 1.0.0 and beyond), your first code snippet works fine: import pandas as pd grad_year = "2029" vars()[f"df_{grad_year}"] = pd.DataFrame( { "class": { 0: "class1", 1: "class2", 2: "class3", 3: "class4", }, "name": {0: "John", 1: "Jack", 2: "", 3: "Butch"}, } ) vars()[f"df_{grad_year}"].replace("", pd.NA, inplace=True) vars()[f"df_{grad_year}"].dropna(subset=["name"], inplace=True) print(vars()[f"df_{grad_year}"]) # Outputs class name 0 class1 John 1 class2 Jack 3 class4 Butch In your second code snippet, you also have to set inplace to False instead of True both times in order for chain assignments to work: vars()[f"df_{grad_year}"] = ( pd.DataFrame( { "class": { 0: "class1", 1: "class2", 2: "class3", 3: "class4", }, "name": {0: "John", 1: "Jack", 2: "", 3: "Butch"}, } ) .replace("", pd.NA, inplace=False) .dropna(subset=["name"], inplace=False) ) print(vars()[f"df_{grad_year}"]) # Output class name 0 class1 John 1 class2 Jack 3 class4 Butch
Return Next Pages of Results Based on Tag in API
I am trying to come up with a script that loops and returns all results from an API. The max transactions per call is 500, and there is a tag 'MoreFlag' that is 0 when there are less than or equal to 500 transactions and 1 when there are more than 500 transactions (per page). How can I write the code so that when 'MoreFlag' is 1 go to the next page until the tag changes to 0? The API requires a license key and password, but here's a piece of the output. r = 0 station_name = 'ORANGE' usageSearchQuery = { 'stationName': station_name, 'startRecord': 1 + r, 'numTransactions': 500 } trans_data = client.service.getTransactionData(usageSearchQuery) for c in enumerate(trans_data): print(c) This returns the following: (0, 'responseCode') (1, 'responseText') (2, 'transactions') (3, 'MoreFlag') Next, if I use this code: for c in enumerate(trans_data.transactions): print(trans_data) # add 500 to startRecord The API returns: { 'responseCode': '100', 'responseText': 'API input request executed successfully.', 'transactions': { 'transactionData': [ { 'stationID': '1’, 'stationName': 'ORANGE', 'transactionID': 178543, 'Revenue': 1.38, 'companyID': ‘ABC’, 'recordNumber': 1 }, { 'stationID': '1’, 'stationName': 'ORANGE', 'transactionID': 195325, 'Revenue': 1.63, 'companyID': ‘ABC’, 'recordNumber': 2 }, { 'stationID': '1’, 'stationName': 'ORANGE', 'transactionID': 287006, 'Revenue': 8.05, 'companyID': ‘ABC’, 'recordNumber': 500 } ] }, 'MoreFlag': 1 } The idea is to pull data from trans_data.transactions.transactionData, but I'm getting tripped up when I need more than 500 results, i.e. subsequent pages.
I figured it out. I guess my only question: is there a cleaner way to do this? It seems kind of repetitive. i = 1 y = [] lr = 0 station_name = 'ORANGE' usageSearchQuery = { 'stationName': station_name, } trans_data = client.service.getTransactionData(usageSearchQuery) for c in enumerate(trans_data): while trans_data.MoreFlag == 1: usageSearchQuery = { 'stationName': station_name, 'startRecord': 1 + lr, 'numTransactions': 500 } trans_data = client.service.getTransactionData(usageSearchQuery) for (d) in trans_data.transactions.transactionData: td = [i, str(d.stationName), d.transactionID, d.transactionTime.strftime('%Y-%m-%d %H:%M:%S'), d.Revenue] i = i + 1 y.append(td) lr = lr + len(trans_data.transactions.transactionData)
Databricks - Pyspark - Handling nested json with a dynamic key
I have an example json data file which has the following structure: { "Header": { "Code1": "abc", "Code2": "def", "Code3": "ghi", "Code4": "jkl", }, "TimeSeries": { "2020-11-25T03:00:00+00:00": { "UnitPrice": 1000, "Amount": 10000, }, "2020-11-26T03:00:00+00:00": { "UnitPrice": 1000, "Amount": 10000, } } } When I parse this into databricks with command: df = spark.read.json("/FileStore/test.txt") I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema: Date UnitPrice Amount As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically: def flatten_json(data): columnlist = data.select("TimeSeries.*") count = 0 for name in data.select("TimeSeries.*"): df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a")) df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a")) if count == 0: df3 = df1.join(df2, on=['join'], how="inner") else: df3 = df3.union(df1.join(df2, on=['join'], how="inner")) count = count + 1 return(df3) This is far from ideal. Does anyone know a better method to create the described dataframe?
The idea: Step 1: Extract Header and TimeSeries separately. Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct. Step 3: Merge all these structs into an array column, and explode it. Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column. Step 5: Cross join with the Header row. import pyspark.sql.functions as F header_df = df.select("Header.*") timeseries_df = df.select("TimeSeries.*") fieldNames = enumerate(timeseries_df.schema.fieldNames()) cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames] combined = explode(array(cols)).alias("comb") timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice') result = header_df.crossJoin(timeseries) result.show(truncate = False) Output: +-----+-----+-----+-----+-------------------------+------+---------+ |Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice| +-----+-----+-----+-----+-------------------------+------+---------+ |abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 | |abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 | +-----+-----+-----+-----+-------------------------+------+---------+
How to filter an existing queryset by the month instead of day
I have a queryset that displays the amount of followers for each day of the month. Views: context["brandtwitterFollowerCounts"] = ( TwitterFollowerCount.objects.filter( twitter_account=self.object.twitter_account, followers__gt=0 ) .distinct("created__date") .order_by("created__date") ) context["brandTwitterFollowCreated"] = [ i.created.strftime("%d %m") for i in context["brandtwitterFollowerCounts"] ] context["brandTwitterFollowers"] = [ i.followers for i in context["brandtwitterFollowerCounts"] Dataset: var newDataObjectTwo = { labels: {{ brandTwitterFollowCreated|safe }}, datasets: [{ label: 'Daily', backgroundColor: 'rgb(255, 255, 255)', fill: false, borderColor: 'rgb(29, 161, 242)', data: {{ brandTwitterFollowers|safe }} }], } I would like to filter this by months instead now. So as there are currently 3 months (08, 09, 10), the graph should only show 3 data points. Sorry if this doesn't make sense, I am not sure what the best way to explain it is.
How to fill column with values based on string contains condition
I have the following data: data = { 'employee' : ['Emp1', 'Emp2', 'Emp3', 'Emp4', 'Emp5'], 'code' : ['2018_1', '2018_3', '2019_1', '2019_2', '2017_1'], } old_salary_bonus = 3000 new_salary_bonus = { '2019_1': 1000, '2019_2': 980, } df = pd.DataFrame(data) Task: Add df['salary_bonus'] column based on the following condition: If employee's code contains '2019', use code value to retrieve salary bonus value from new_salary_bonus, else use old_salary_bonus value. Expected Output: employee code salary_bonus 0 Emp1 2018_1 3000 1 Emp2 2018_3 3000 2 Emp3 2019_1 1000 3 Emp4 2019_2 980 4 Emp5 2017_1 3000 Please help.
Use Series.map with Series.fillna for repplace non matched values: import pandas as pd data = { 'employee' : ['Emp1', 'Emp2', 'Emp3', 'Emp4', 'Emp5'], 'code' : ['2018_1', '2018_3', '2019_1', '2019_2', '2017_1'], } old_salary_bonus = 3000 new_salary_bonus = { '2019_1': 1000, '2019_2': 980, } df = pd.DataFrame(data) df['salary_bonus'] = df['code'].map(new_salary_bonus).fillna(old_salary_bonus) print (df) employee code salary_bonus 0 Emp1 2018_1 3000.0 1 Emp2 2018_3 3000.0 2 Emp3 2019_1 1000.0 3 Emp4 2019_2 980.0 4 Emp5 2017_1 3000.0 Another solution with get with default value if not matched: df['salary_bonus'] = df['code'].map(lambda x: new_salary_bonus.get(x, old_salary_bonus))
You can use the code below: df['salary_bonus'] = old_salary_bonus df.loc[df['code'].isin(list(new_salary_bonus)), 'salary_bonus'] = list(new_salary_bonus.values())