I have a JSON response (sample below) that I'm trying to convert into a DataFrame. I've had several issues with the data being listed as columns (1 x 346), etc. I only need the 5 columns listed below:
area_name,
date,
month,
unemployment_rate,
year
Here's my code:
edd_ca_df = pd.DataFrame.from_dict(edd_ca, orient="index",
columns=["area_name", "month", "date", "year", "unemployment_rate"])
and here's a sample of the JSON response:
[[{'area_name': 'California',
'area_type': 'State',
'date': '1990-01-01T00:00:00.000',
'employment': '14099700',
'labor_force': '14953900',
'month': 'January',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '854200',
'unemployment_rate': '5.7',
'year': '1990'},
{'area_name': 'California',
'area_type': 'State',
'date': '1990-02-01T00:00:00.000',
'employment': '14206700',
'labor_force': '15049400',
'month': 'February',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '842800',
'unemployment_rate': '5.6',
'year': '1990'},
Any help would be greatly appreciated.
Since you have a list of dictionaries, this is as simple as passing all the data to a new DataFrame and specifying what columns you want to keep:
import pandas as pd
all_data = [{'area_name': 'California',
'area_type': 'State',
'date': '1990-01-01T00:00:00.000',
'employment': '14099700',
'labor_force': '14953900',
'month': 'January',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '854200',
'unemployment_rate': '5.7',
'year': '1990'},
{'area_name': 'California',
'area_type': 'State',
'date': '1990-02-01T00:00:00.000',
'employment': '14206700',
'labor_force': '15049400',
'month': 'February',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '842800',
'unemployment_rate': '5.6',
'year': '1990'}]
keep_columns = ['area_name','date','month','unemployment_rate','year']
df = pd.DataFrame(columns=keep_columns, data=all_data)
print(df)
Output
area_name date month unemployment_rate year
0 California 1990-01-01T00:00:00.000 January 5.7 1990
1 California 1990-02-01T00:00:00.000 February 5.6 1990
Related
data =
{'gems': [{'name': 'garnet', 'colour': 'red', 'month': 'January'},
{'name': 'emerald', 'colour': 'green', 'month': 'May'},
{'name': "cat's eye", 'colour': 'yellow', 'month': 'June'},
{'name': 'sardonyx', 'colour': 'red', 'month': 'August'},
{'name': 'peridot', 'colour': 'green', 'month': 'September'},
{'name': 'ruby', 'colour': 'red', 'month': 'December'}]}
How do I create a list of colours and then just find the months with the colour red?
I've tried for and if, but I keep getting the error message
string indices must be integers
Because you have dictionaries within a list, you can use a list-comprehension with nested if logic to filter out those values you don't want:
[x['month'] for x in data['gems'] if x['colour'] == 'red']
Returns:
['January', 'August', 'December']
Assuming that one wants the output as a dataframe, one can use pandas.json_normalize and pandas.DataFrame.query as follows
df = pd.json_normalize(data['gems']).query('colour == "red"')['month']
[Out]:
0 January
3 August
5 December
If one wants the index to be reset, one needs to pass pandas.DataFrame.reset_index as
df = pd.json_normalize(data['gems']).query('colour == "red"')['month'].reset_index(drop=True)
[Out]:
0 January
1 August
2 December
I have the next DataFrame:
a = [{'order': '789', 'name': 'A', 'date': 20220501, 'sum': 15.1}, {'order': '456', 'name': 'A', 'date': 20220501, 'sum': 19}, {'order': '704', 'name': 'B', 'date': 20220502, 'sum': 14.1}, {'order': '704', 'name': 'B', 'date': 20220502, 'sum': 22.9}, {'order': '700', 'name': 'B', 'date': 20220502, 'sum': 30.1}, {'order': '710', 'name': 'B', 'date': 20220502, 'sum': 10.5}]
df = pd.DataFrame(a)
print(df)
I need, to distinct (count) value in column order and to add values to the new column order_count, grouping by columns name and date, sum values in column sum.
I need to get the next result:
In your case do
out = df.groupby(['name','date'],as_index=False).agg({'sum':'sum','order':'nunique'})
Out[652]:
name date sum order
0 A 20220501 34.1 2
1 B 20220502 77.6 3
import pandas as pd
df[['name','date','sum']].groupby(by=['name','date']).sum().reset_index().rename(columns={'sum':'order_count'}).join(df[['name','date','sum']].groupby(by=['name','date']).count().reset_index().drop(['name','date'],axis=1))
im doing a simple groupby on my data as shown in the code below. Is there a manner to do it directly without the drop_duplicates please, in the same line of code?
Thank you
df_brut['Revenue'] = df_brut.groupby(['cod', 'date', 'zone'])['Revenue'].transform('sum')
df_brut = df_brut.drop_duplicates()
df_brut.columns = ['cod','date', 'zone','SUM_']
My data
data1 = {'date': ['2021-06', '2021-06', '2021-07', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '12', '14', '15', '15', '18'], 'zone': ['LA', 'LA', 'LA', 'PARIS', 'PARIS', 'PARIS'], 'Revenue': [10, 20, 30, 50, 40, 10]}
df_brut= pd.DataFrame(data1)
the grouped data expected is
data2 = {'date': ['2021-06', '2021-07', '2021-07', '2021-07'], 'cod': ['12', '14', '15','18'], 'zone': ['LA', 'LA', 'PARIS', 'PARIS'], 'SUM_': [30, 30, 90, 10]}
df_grouped= pd.DataFrame(data2)
You could do:
(df_brut.groupby(['cod', 'date', 'zone'], as_index=False)['Revenue']
.sum()
.rename({'Revenue': '_SUM'}, axis=1)
)
Following dataframe df is given:
df = pd.DataFrame({'ISIN': ['Cash', 'CH0038863350', 'DE0007164600'],
'Country': ['United States', 'Switzerland', 'Germany'], 'Category': ['A', 'B', 'C']})
If value of ISIN is 'Cash' the value of 'Category' shall be changed to 'Cash'. Thus means df will become
df = pd.DataFrame({'ISIN': ['Cash', 'CH0038863350', 'DE0007164600'],
'Country': ['United States', 'Switzerland', 'Germany'], 'Category': ['Cash', 'B', 'C']})
How to do this?
I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?
There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016
You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name