I have this dataframe:
data = [{"name": "test", "sentiment":'positive', "avg":13.65, "stddev":15.24},
{"name": "test", "sentiment":'neutral', "avg":338.74, "stddev":187.27},
{"name": "test", "sentiment":'negative', "avg":54.58, "stddev":50.19}]
df = spark.createDataFrame(data).select("name", "sentiment", "avg", "stddev")
df.show()
+----+---------+------+------+
|name|sentiment| avg|stddev|
+----+---------+------+------+
|test| positive| 13.65| 15.24|
|test| neutral|338.74|187.27|
|test| negative| 54.58| 50.19|
+----+---------+------+------+
I'd like to create a dataframe with this structure:
+----+------------+-----------+------------+------------+-----------+------------+
|name|avg_positive|avg_neutral|avg_negative|std_positive|std_neutral|std_negative|
+----+------------+-----------+------------+------------+-----------+------------+
|test| 13.65| 338.74| 54.58| 15.24| 187.27| 50.19|
+----+------------+-----------+------------+------------+-----------+------------+
I also don't know the name of this operation, feel free to suggest a proper title.
Thanks!
use groupBy() and pivot()
df_grp = df.groupBy("name").pivot("sentiment").agg((F.first("avg").alias("avg")),(F.first("stddev").alias("stddev")) )
df_grp.show()
+----+------------+---------------+-----------+--------------+------------+---------------+
|name|negative_avg|negative_stddev|neutral_avg|neutral_stddev|positive_avg|positive_stddev|
+----+------------+---------------+-----------+--------------+------------+---------------+
|test| 54.58| 50.19| 338.74| 187.27| 13.65| 15.24|
+----+------------+---------------+-----------+--------------+------------+---------------+
rename the columns if you really want to
Related
I have the following pandas DataFrame:
account_number = [1234, 5678, 9012, 1234.0, 5678, 9012, 1234.0, 5678, 9012, 1234.0, 5678, 9012]
client_name = ["Ford", "GM", "Honda", "Ford", "GM", "Honda", "Ford", "GM", "Honda", "Ford", "GM", "Honda"]
database = ["DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda"]
server = ["L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12"]
order_num = [2145479, 2145506, 2145534, 2145603, 2145658, 2429513, 2145489, 2145516, 2145544, 2145499, 2145526, 2145554]
customer_dob = ["1967-12-01", "1963-07-09", "1986-12-05", "1967-11-01", None, "1986-12-05", "1967-12-01", "1963-07-09", "1986-12-05", "1967-12-01", "1963-07-09", "1986-12-04"]
purchase_date = ["2022-06-18", "2022-04-11", "2021-01-18", "2022-06-20", "2022-04-11", "2021-01-18", "2022-06-22", "2022-04-13", "2021-01-18", "2022-06-24", "2022-04-18", "2021-01-18"]
d = {
"account_number": account_number,
"client_name" : client_name,
"database" : database,
"server" : server,
"order_num" : order_num,
"customer_dob" : customer_dob,
"purchase_date" : purchase_date,
}
df = pd.DataFrame(data=d)
dates = ["customer_dob", "purchase_date"]
for date in dates:
df[date] = pd.to_datetime(df[date])
The customer's date of birth (DOB) and purchase date (PD) should be the same per account_number, but since there can be a data entry error on either one, I want to perform a groupby on the account_number and get the max on the DOB, and the min on the PD. This is easy to do if all I want are those two columns in addition to the account_number:
result = df.groupby("account_number").agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result
However, I want the other columns as well, as they are guaranteed to be the same for each account_number. The problem is, when I attempt to include the other columns, I get multi-level columns, which I don't want. This first attempt not only produced multi-level columns, but I don't even see the actual values for DOB and PD
result = df.groupby("account_number")["client_name", "database", "server", "order_num"].agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result
The second attempt included the DOB and PD, but now twice for each account number, while still producing multi-level columns:
result = df.groupby("account_number")["client_name", "database", "server", "order_num", "customer_dob", "purchase_date"].agg(
{"patient_dob": "max", "insert_date": "min"}).reset_index()
result
I just want the end result to look like this:
So, that's my question to all you Python experts: what do I need to do to accomplish this?
leaving the order number out, per your comment above. If order # are same for an account, then add the order number to the columns list in merge
result = df.groupby("account_number").agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result.merge(df[['account_number','client_name','database','server' ]] ,
how='left',
on='account_number').drop_duplicates()
account_number customer_dob purchase_date client_name database server
0 1234.0 1967-12-01 2022-06-18 Ford DB_Ford L01SQL04
4 5678.0 1963-07-09 2022-04-11 GM DB_GM L01SQL08
8 9012.0 1986-12-05 2021-01-18 Honda DB_Honda L01SQL12
You can use:
agg(new_col_1=(col_1, 'sum'), new_col_2=(col_2, 'min'))
I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)
IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)
You can use the explode function
df = df.apply(pd.Series.explode)
I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null
I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.
I have a JSON files Data. Given below is a sample of it.
[{
"Type": "Fruit",
"Names": "Apple;Orange;Papaya"
}, {
"Type": "Veggie",
"Names": "Cucumber;Spinach;Tomato"
}]
I have to read the Names and match each item of the Names with a column in another df.
I am stuck at converting the value of the Names key into a list that can be used in Pattern. The code I tried is
df1 = pd.DataFrame(data)
PriList=df1['Names'].str.split(";", n = 1, expand = True)
Pripat = '|'.join(r"\b{}\b".format(x) for x in PriList)
df['Match'] = df['MasterList'].str.findall('('+ Pripat + ')').str.join(', ')
The issue is with the Pripat. Its content is
\bApple, Orange\b
If I give the Names in a list like below
Prilist=['Apple','Orange','Papaya']
the code works fine...
Please help.
You'll need to call str.split and then flatten the result using itertools.chain.
First, do
df2 = df1.loc[df1.Type.eq('Fruit')]
Now,
from itertools import chain
prilist = list(chain.from_iterable(df2.Names.str.split(';').values))
There's also stack (which is slower):
prilist = df2.Names.str.split(';', expand=True).stack().tolist()
print(prilist)
['Apple', 'Orange', 'Papaya']
df2 = df1.loc[df1.Type.eq('Fruit')]
out_list=';'.join(df2['Names'].values).split(';')
#print(out_list)
['Apple', 'Orange', 'Papaya']
I have some JSON, returned from an API call, that looks something like this:
{
"result": {
"code": "OK",
"msg": ""
},
"report_name": "FAMOUS_DICTATORS",
"columns": [
"rank",
"name",
"deaths"
],
"data": [
{
"row": [
1,
"Mao Zedong",
63000000
]
},
{
"row": [
2,
"Jozef Stalin",
23000000
]
}
]
}
I'd like to convert the JSON into a Pandas DataFrame:
rank name deaths
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
I wrote this and it works, but looks a bit ugly:
import pandas as pd
import json
columns = eval(r.content)['columns']
df = pd.DataFrame(columns = eval(r.content)['columns'])
for row in eval(r.content)['data']:
df.loc[len(df)+1] = row['row']
Is there a more elegant/Pythonic way to do this (e.g. possibly using pandas.io.json.read_json)?
The read_json function of pandas is a very tricky method to use. If you don't know with certainty the validity of your JSON object or whether its initial structure is sane enough to build a dataframe around, it's much better to stick to tried and tested methods to break your data down to something that pandas can use without issues 100%.
In your case, I suggest breaking down your data to a list of lists. Out of all that JSON, the only part you really need is in the data and column keys.
Try this:
import pandas as pd
import json
import urllib
js = json.loads(urllib.urlopen("test.json").read())
data = js["data"]
rows = [row["row"] for row in data] # Transform the 'row' keys to list of lists.
df = pd.DataFrame(rows, columns=js["columns"])
print df
This gives me the desired result:
rank name deaths
0 1 Mao Zedong 63000000
1 2 Jozef Stalin 23000000
see pandas.io.json.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html