Comparing two df to discover the missing rows

Comparing two df to discover the missing rows - python

I have two pandas dataframes. One has 7000 lines, another one has 7003. Technicaly they both should have the same column (a column whith names of cities). So one dataframe is missing 3 cities.
I need to discover which are these missing cities in my df. I want to compare my two dataframes and discover which lines are missiing in the other one.
How could I do that? How could I do a code which give me the exact missing rows (name of the cities) in my df, in comparison to the other?
df1
+-------+--------------+
| id | cities |
+-------+--------------+
| 1 | London |
| 2 | New York |
| 3 | Rio de Jan. |
| 4 | Roma |
| 5 | Berlin |
| 6 | Paris |
| 7 | Tokio |
+-------+--------------+
df2
+-------+--------------+
| id | cities |
+-------+--------------+
| 1 | London |
| 2 | New York |
| 3 | Rio de Jan. |
| 4 | Roma |
| 5 | Berlin |
| 6 | Paris |
+-------+--------------+

One approach using set:
missing_cities = set(df1["cities"]) - set(df2["cities"])
print(missing_cities)
Output
{'Tokio'}
As an alternative, use difference:
missing_cities = set(df1["cities"]).difference(df2["cities"])
The time complexity of both approaches is O(n + m), where n and m are the length of both columns.

another method is to use concat and .duplicated(keep=False) with a boolean filter.
when using .concat you can pass in an optional arg called keys which allows you to know which dataframe is which via the index.
dfc = pd.concat([df1,df2],keys=[1,2])
dfc[~dfc.duplicated(subset='cities',keep=False)]
id cities
1 6 7 Tokio

Related

Reading file with pandas read csv is not working

All the column data is going inside the "index" column
the header starts from row number 7
'''
index mfg legalId resellerName resellerCountry
(SONICWALL',' ','HEXAPAGE','FRANCE')
(SONICWALL',' ','SEXTANT BTS LLC','UNITED STATES')
(SONICWALL',' ','New Vision Networks, Inc.','UNITED STATES')
'''
All the values are inside index column, i want those values to come under specified column respectively
Specified columns are
mfg, legalId, resellerName, resellerCountry
Below is the code which i have written, please help me how to do this
df2=pd.read_csv(data, header=6, keep_default_na=False, sep=',', delimiter=',', quoting=csv.QUOTE_MINIMAL)

if your .csv file already has columns as first row then remove header=6 argument and let it infer which is default.
If you keep "index" column in csv file then with header='infer' dataframe will look like below table which is not aligned with respect to data. As data is shifted to left because data does not have index values mentioned
+----+-----------+-------+--------------------------+----------------+-------------------+
| | index | mfg | legalId | resellerName | resellerCountry |
+====+===========+=======+==========================+================+===================+
| 0 | SONICWALL | | HEXAPAGE | FRANCE | |
+----+-----------+-------+--------------------------+----------------+-------------------+
| 1 | SONICWALL | | SEXTANT BTS LLC | UNITED STATES | |
+----+-----------+-------+--------------------------+----------------+-------------------+
| 2 | SONICWALL | | New Vision Networks Inc. | UNITED STATES | |
+----+-----------+-------+--------------------------+----------------+-------------------+
you can remove "index" column from .csv file and reset_index on dataframe by:
df2.reset_index(level=0, inplace=True)
and data will be:
+----+---------+-----------+-----------+--------------------------+-------------------+
| | index | mfg | legalId | resellerName | resellerCountry |
+====+=========+===========+===========+==========================+===================+
| 0 | 0 | SONICWALL | | HEXAPAGE | FRANCE |
+----+---------+-----------+-----------+--------------------------+-------------------+
| 1 | 1 | SONICWALL | | SEXTANT BTS LLC | UNITED STATES |
+----+---------+-----------+-----------+--------------------------+-------------------+
| 2 | 2 | SONICWALL | | New Vision Networks Inc. | UNITED STATES |
+----+---------+-----------+-----------+--------------------------+-------------------+

Python Pandas datasets - Having integers values in new column by making a dictionary

I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |

Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0

Is there any way to rearrange excel data without copy paste?

I have an excel file that contain country name and dates as column name.
+---------+------------+------------+------------+
| country | 20/01/2020 | 21/01/2020 | 22/01/2020 |
+--------- ------------+------------+------------+
| us | 0 | 5 | 6 |
+---------+------------+------------+------------+
| Italy | 20 | 23 | 33 |
+--------- ------------+------------+------------+
| India | 0 | 0 | 6 |
+---------+------------+------------+------------+
But i need to arrange column names country, date, and count. Is there any way to rearrange excel data without copy paste.
final excel sheet need to look like this
+---------+------------+------------+
| country | date | count |
+--------- ------------+------------+
| us | 20/01/2020 | 0 |
+---------+------------+------------+
| us | 21/01/2020 | 5 |
+---------+------------+------------+
| us | 22/01/2020 | 6 |
+---------+------------+------------+
| Italy | 20/01/2020 | 20 |
+--------- ------------+------------+
| Italy | 21/01/2020 | 23 |
+--------- ------------+------------+
| Italy | 22/01/2020 | 33 |
+--------- ------------+------------+
| India | 20/01/2020 | 0 |
+---------+------------+------------+

Unpivot using Power Query:
Data --> Get & Transform --> From Table/Range
Select the country column
Unpivot Other columns
Rename the resulting Attribute and Value columns to date and count
Because the Dates which are in the header are turned into Text, you may need to change the date column type to date, or, as I did, to date using locale
M-Code
Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"country", type text}, {"20/01/2020", Int64.Type}, {"21/01/2020", Int64.Type}, {"22/01/2020", Int64.Type}}),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"country"}, "date", "count"),
#"Changed Type with Locale" = Table.TransformColumnTypes(#"Unpivoted Other Columns", {{"date", type date}}, "en-150")
in
#"Changed Type with Locale"

Power Pivot is the best way, but if you want to use formulas:
In F1 enter:
=INDEX($A$2:$A$4,ROUNDUP(ROWS($1:1)/3,0))
and copy downward. In G1 enter:
=INDEX($B$1:$D$1,MOD(ROWS($1:1)-1,3)+1)
and copy downward. H1 enter:
=INDEX($B$2:$D$4,ROUNDUP(ROWS($1:1)/3,0),MOD(ROWS($1:1)-1,3)+1)
and copy downward
The 3 in these formulas is because we have 3 dates in the original table.

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.

You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

find a record across multiple python pandas dataframes

Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3

One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two df to discover the missing rows - python

Related

Reading file with pandas read csv is not working

Python Pandas datasets - Having integers values in new column by making a dictionary

Is there any way to rearrange excel data without copy paste?

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

find a record across multiple python pandas dataframes

Categories

Resources