I have done .value_counts() on two dataFrames (similar column) and would like to compare the two.
I also tried with converting the resulting Series to dataframes (.to_frame('counts') as suggested in this thread), but it doesn't help.
first = df1['company'].value_counts()
second = df2['company'].value_counts()
I tried to merge but I think the main problem is that I dont have the company name as a column but its the index (?). Is there a way to resolve it or to use a different way to get the comparison?
GOAL: The end goal is to be able to see which companies occur more in df2 than in df1, and the value_counts() themselves (or the difference between them).
You might use collections.Counter ability to subtract as follows
import collections
import pandas as pd
df1 = pd.DataFrame({'company':['A','A','A','B','B','C','Y']})
df2 = pd.DataFrame({'company':['A','B','B','C','C','C','Z']})
c1 = collections.Counter(df1['company'])
c2 = collections.Counter(df2['company'])
c1.subtract(c2)
print(c1)
gives output
Counter({'A': 2, 'Y': 1, 'B': 0, 'Z': -1, 'C': -2})
Explanation: where value is positive means more instances are in df1, where value is zero then number is equal, where value is negative means more instances are in df2.
Use from this code
df2['x'] = '2'
df1['x'] = '1'
df = pd.concat([df1[['company', 'x']], df2[['company', 'x']]])
df = pd.pivot_table(df, index=['company'], columns=['x'], aggfunc={'values': 'sum'}).reset_index()
Now filter on df for related data
Related
I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6
BACKGROUND:
I have two columns: 'address' and 'raw_data'. The dataset looks like this:
this is just a sample I made up, the original dataset is over 6m rows and in a different language
Problem:
I need to find all the data where the 'address' and 'raw_data' are not matched meaning there were some sorta of mistakes were made when logging in the data from 'address' to 'raw_data.
I'm fairly new to Pandas. My plan is separate the 'raw_data' column by comma, then compare the newly produced columns with the original 'address' column (to see if the 'address' column has those info, if not, that means there is a mistake?).
Like I said, I'm new to pandas and this is what I have so far.
import pandas as pd
columns = ['address', 'raw_data']
df=pd.read_csv('address.csv', usecols=columns)
df = pd.concat([df['address'], df['raw_data'].str.split(',', expand=True)], axis=1)
Now the new columns has info like this: "CITY":"ATLANTA". I want to the columns to just have ATLANTA without all the the colons and 'CITY' in order to compare the info with 'address' column.
How should I go on about it?
Also, at this point of my pandas learning experience, I do not yet know how to compare two columns. Could someone help a newbie out please? Thanks a lot!
PS: by comparison of two columns I meant to check whether one column has the characters in the second column, not to check whether the two columns are equal. Just want to point that out.
df = pd.DataFrame([[2, 2], [3, 6],[1,1]], columns = ["col1", "col2"])
comparison_column = np.where(df["col1"] == df["col2"], True, False)
df["equal"] = comparison_column
col1 col2 equal
2 2 True
3 6 False
1 1 True
I will use this data:
import numpy as np
import pandas as pd
j = {"address":"foo","b": "bar"}
j2 = {"address":"foo2","b": "bar2"}
values = [["foo", j], ["bar", j2]]
df = pd.DataFrame(data=values, columns=["address", "raw_data"])
df
address raw_data
0 foo {'address': 'foo', 'b': 'bar'}
1 bar {'address': 'foo2', 'b': 'bar2'}
I will separate columns from raw_data (with .values.tolist()) in another df (df2):
df2 = pd.DataFrame(df['raw_data'].values.tolist())
df2
address b
0 foo bar
1 foo2 bar2
To compare you use:
df.address == df2.address
0 True
1 False
If you need save this in the original df you can add a column:
df["result"] = df.address == df2.address
You can separate them from , by just treating them as a dict. You can map custom functions to columns with apply function. In this case you have define a function that accesses to keys of dictionary and extracts values.
df['address_raw'] = df['raw_data'].apply(lambda x: x['address'])
df['city_raw'] = df['raw_data'].apply(lambda x: x['CITY'])
df['addrline2_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_2'])
df['addrline3_raw'] = df['raw_data'].apply(lambda x: x['ADDR_LINE_3'])
df['utmnorthing_raw'] = df['raw_data'].apply(lambda x: x['UTM_NORTHING'])
These lines will create columns of each field in the dict and then you can just compare the ones like:
df['address'] == df['address_raw']
I'd like to check the difference between two DataFrame columns. I tried using the command:
np.setdiff1d(train.columns, train_1.columns)
which results in an empty array:
array([], dtype=object)
However, the number of columns in the dataframes are different:
len(train.columns), len(train_1.columns) = (51, 56)
which means that the two DataFrame are obviously different.
What is wrong here?
The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.
If you do not care which of the dataframes have the unique columns you can use setxor1d. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.
import numpy
colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']
c = numpy.setxor1d(colsA, colsB)
Will return you an array containing 'a' and 'd'.
If you want to use setdiff1d you need to check for differences both ways:
//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)
//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)
use something like this
data_3 = data1[~data1.isin(data2)]
Where data1 and data2 are columns and data_3 = data_1 - data_2
In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?
Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64
I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.
If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.