Nesting dataframe using pyspark - python

I am new to pyspark, I am trying to have multiple country data in a single row. I dont know the exact number of country fields i will get. So, i want to have a row where i will have multiple data of country name and country capital according to the following schema.Is it possible to do it using pyspark?
StructField('id', LongType()),
StructField('country', StructType([
StructField('name', StringType()),
StructField('capital', StringType())
])),
StructField('review', StringType())
])```
```data = [[(1,[(Japan, Tokyo),(France, Paris),(Uk, London)], 'nice']
[2,[(Japan, Tokyo),(France, Paris),(Uk, London),(US,
Washington), 'not good']
```
I am dealing with hierarchical data, I want all to have all countries and capitals present in the list with id = 1 in a single row of id = 1. Converting this tuple into a separate list of countries and capitals is not an option because a number of these tuples are different for every data.
Expected dataframe -
+----+---------+------------+----------+
| id | name | capital | review |
+----+---------+------------+----------+
| 1 | Japan | Tokyo | Nice |
| | France | Paris | |
| | UK | London | |
+----+---------+------------+----------+
| 2 | Japan | Tokyo | Not Good |
| | France | Paris | |
| | UK | London | |
| | US | Washington | |
+----+---------+------------+----------+

Related

Comparing two df to discover the missing rows

I have two pandas dataframes. One has 7000 lines, another one has 7003. Technicaly they both should have the same column (a column whith names of cities). So one dataframe is missing 3 cities.
I need to discover which are these missing cities in my df. I want to compare my two dataframes and discover which lines are missiing in the other one.
How could I do that? How could I do a code which give me the exact missing rows (name of the cities) in my df, in comparison to the other?
df1
+-------+--------------+
| id | cities |
+-------+--------------+
| 1 | London |
| 2 | New York |
| 3 | Rio de Jan. |
| 4 | Roma |
| 5 | Berlin |
| 6 | Paris |
| 7 | Tokio |
+-------+--------------+
df2
+-------+--------------+
| id | cities |
+-------+--------------+
| 1 | London |
| 2 | New York |
| 3 | Rio de Jan. |
| 4 | Roma |
| 5 | Berlin |
| 6 | Paris |
+-------+--------------+
One approach using set:
missing_cities = set(df1["cities"]) - set(df2["cities"])
print(missing_cities)
Output
{'Tokio'}
As an alternative, use difference:
missing_cities = set(df1["cities"]).difference(df2["cities"])
The time complexity of both approaches is O(n + m), where n and m are the length of both columns.
another method is to use concat and .duplicated(keep=False) with a boolean filter.
when using .concat you can pass in an optional arg called keys which allows you to know which dataframe is which via the index.
dfc = pd.concat([df1,df2],keys=[1,2])
dfc[~dfc.duplicated(subset='cities',keep=False)]
id cities
1 6 7 Tokio

Reading file with pandas read csv is not working

All the column data is going inside the "index" column
the header starts from row number 7
'''
index mfg legalId resellerName resellerCountry
(SONICWALL',' ','HEXAPAGE','FRANCE')
(SONICWALL',' ','SEXTANT BTS LLC','UNITED STATES')
(SONICWALL',' ','New Vision Networks, Inc.','UNITED STATES')
'''
All the values are inside index column, i want those values to come under specified column respectively
Specified columns are
mfg, legalId, resellerName, resellerCountry
Below is the code which i have written, please help me how to do this
df2=pd.read_csv(data, header=6, keep_default_na=False, sep=',', delimiter=',', quoting=csv.QUOTE_MINIMAL)
if your .csv file already has columns as first row then remove header=6 argument and let it infer which is default.
If you keep "index" column in csv file then with header='infer' dataframe will look like below table which is not aligned with respect to data. As data is shifted to left because data does not have index values mentioned
+----+-----------+-------+--------------------------+----------------+-------------------+
| | index | mfg | legalId | resellerName | resellerCountry |
+====+===========+=======+==========================+================+===================+
| 0 | SONICWALL | | HEXAPAGE | FRANCE | |
+----+-----------+-------+--------------------------+----------------+-------------------+
| 1 | SONICWALL | | SEXTANT BTS LLC | UNITED STATES | |
+----+-----------+-------+--------------------------+----------------+-------------------+
| 2 | SONICWALL | | New Vision Networks Inc. | UNITED STATES | |
+----+-----------+-------+--------------------------+----------------+-------------------+
you can remove "index" column from .csv file and reset_index on dataframe by:
df2.reset_index(level=0, inplace=True)
and data will be:
+----+---------+-----------+-----------+--------------------------+-------------------+
| | index | mfg | legalId | resellerName | resellerCountry |
+====+=========+===========+===========+==========================+===================+
| 0 | 0 | SONICWALL | | HEXAPAGE | FRANCE |
+----+---------+-----------+-----------+--------------------------+-------------------+
| 1 | 1 | SONICWALL | | SEXTANT BTS LLC | UNITED STATES |
+----+---------+-----------+-----------+--------------------------+-------------------+
| 2 | 2 | SONICWALL | | New Vision Networks Inc. | UNITED STATES |
+----+---------+-----------+-----------+--------------------------+-------------------+

Is there any way to rearrange excel data without copy paste?

I have an excel file that contain country name and dates as column name.
+---------+------------+------------+------------+
| country | 20/01/2020 | 21/01/2020 | 22/01/2020 |
+--------- ------------+------------+------------+
| us | 0 | 5 | 6 |
+---------+------------+------------+------------+
| Italy | 20 | 23 | 33 |
+--------- ------------+------------+------------+
| India | 0 | 0 | 6 |
+---------+------------+------------+------------+
But i need to arrange column names country, date, and count. Is there any way to rearrange excel data without copy paste.
final excel sheet need to look like this
+---------+------------+------------+
| country | date | count |
+--------- ------------+------------+
| us | 20/01/2020 | 0 |
+---------+------------+------------+
| us | 21/01/2020 | 5 |
+---------+------------+------------+
| us | 22/01/2020 | 6 |
+---------+------------+------------+
| Italy | 20/01/2020 | 20 |
+--------- ------------+------------+
| Italy | 21/01/2020 | 23 |
+--------- ------------+------------+
| Italy | 22/01/2020 | 33 |
+--------- ------------+------------+
| India | 20/01/2020 | 0 |
+---------+------------+------------+
Unpivot using Power Query:
Data --> Get & Transform --> From Table/Range
Select the country column
Unpivot Other columns
Rename the resulting Attribute and Value columns to date and count
Because the Dates which are in the header are turned into Text, you may need to change the date column type to date, or, as I did, to date using locale
M-Code
Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"country", type text}, {"20/01/2020", Int64.Type}, {"21/01/2020", Int64.Type}, {"22/01/2020", Int64.Type}}),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"country"}, "date", "count"),
#"Changed Type with Locale" = Table.TransformColumnTypes(#"Unpivoted Other Columns", {{"date", type date}}, "en-150")
in
#"Changed Type with Locale"
Power Pivot is the best way, but if you want to use formulas:
In F1 enter:
=INDEX($A$2:$A$4,ROUNDUP(ROWS($1:1)/3,0))
and copy downward. In G1 enter:
=INDEX($B$1:$D$1,MOD(ROWS($1:1)-1,3)+1)
and copy downward. H1 enter:
=INDEX($B$2:$D$4,ROUNDUP(ROWS($1:1)/3,0),MOD(ROWS($1:1)-1,3)+1)
and copy downward
The 3 in these formulas is because we have 3 dates in the original table.

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.
You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

How to overwrite identical column names when performing an "outer" join in Pandas?

I am trying to merge/join two csv's, based on a unique city/country/state column combination using Pandas. However, when I try to do this using an outer join, I am getting extra columns when instead I would prefer to have the "right" side of my join overwrite the columns in the "left" side of the join. Any suggestions?
Here is my attempt, with an example:
These are my csv's:
My "left" csv file:
| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+---------+---------|
| beijing | cn | 22 | 456 | 456 | 456 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 |
My "right" csv file:
| city | country | state | pop | lat | long |
|-------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |
and I want this result:
| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |
Note that mexico city and beijing are considered matches, based on their city, country, and state columns. Also note that on these matching rows, each column from my "left" csv is overwritten by the matching column from my "right" csv.
So here is my attempt using Pandas and dataframes:
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
result = pd.merge(left, right, on=['city', 'country', 'state'], how='outer')
Unfortunately, here is my result:
| city | country | state | pop_x | lat_x | long_x | pop_y | lat_y | long_y |
|--------------+---------+-------+----------+-----------+------------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 | 32707 | 33.219442 | -86.823907 |
| albertville | us | al | | 34.26313 | -86.21066 | | 34.26313 | -86.21066 |
| beijing | cn | 22 | 456 | 456 | 456 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 | 12294193 | 19.428 | -99.128 |
| mumbai | in | 16 | 12691836 | 19.073 | 72.883 | 12691836 | 19.073 | 72.883 |
| shanghai | cn | 23 | 22315474 | 31.222 | 121.458 | 22315474 | 31.222 | 121.458 |
As shown above, the columns that are not being used for the join, and which have the same name, are renamed with a _x suffix for the "left" dataframe and a _y suffix for the "right" dataframe.
Is there a simple way to make the columns from the "right" dataframe to overwrite the columns from the "left" dataframe when matched?
Although there seem to be similar questions already out there, I still can't seem to find an answer. For example, I tried implementing the solution based on this question:
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
left = left.set_index(['city','country','state'])
right = right.set_index(['city','country','state'])
left.update(right)
But update only performs left joins, so the resulting dataframe only has the same rows from the left dataframe, so it is missing cities like adamsville and alabaster above.
Since the column names for both dataframes are the same you could stack them and then do a drop_duplicates or groupby
For example:
result = pd.concat([left, right]).reset_index()
result.drop_duplicates(['city','country','state'], keep='first', inplace=True)
or:
df_stacked = pd.concat([left, right]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()
Calling first will take the values from the "left" df over the "right" df because we're stacking the "left" df on top of the "right" df and resetting the index
Using groupby will allow you to perform more complex selects on the aggregated records if you don't want to just take the first or last record.
EDIT:
Just realized you want the "right" df to overwrite the "left" df, in that case...
df_stacked = pd.concat([right, left]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()
This methodology only works if the "left" and "right" dataframes don't contain duplicate records to start.
And for the record, to get to the csv solution in the example above, we can perform the following:
result = result.reset_index()
# sort our descending population, and if populations are equal (or NaN), sort by ascending city name
result = result.sort_values(['pop', 'city'], ascending=[False, True])
result.drop('index', axis=1, inplace=True)
result.to_csv('result.csv', index=False)
Try:
res = pd.concat([left, right], ignore_index=True)
res = res.drop(res[res['city'].duplicated(keep='last')].index, axis=0)
Try this:
result = left.append(right).drop_duplicates(['city'], keep='last')

Categories

Resources