merging two dataset with python - python

first dataset: dim(d)=(70856886 12), Second dataset: dim(e)=(354 6)
both data set have common variable which is subject and I want to merge both dataset by subject, I used this code by python:
# Merging both dataset:
data=pd.merge(d, e, on='subject')
When I do that I lost some data set my dim of my new merging dataset is 62611728
my question is why I am losing those observation?? [70856886- 62611728= 8245158]

As the documentation states, pd.merge() "Merge[s] DataFrame or named Series objects with a database-style join."
In general, it's a good idea to try something on a small dataset to see if you understand its function correctly and then to apply it to a large dataset.
Here's an example for pd.merge():
import pandas as pd
df1 = pd.DataFrame([
{'subject': 'a', 'value': 1},
{'subject': 'a', 'value': 2},
{'subject': 'b', 'value': 3},
{'subject': 'c', 'value': 4},
{'subject': 'c', 'value': 5},
])
df2 = pd.DataFrame([
{'subject': 'a', 'other': 6},
{'subject': 'b', 'other': 7},
{'subject': 'b', 'other': 8},
{'subject': 'd', 'other': 9}
])
df = pd.merge(df1, df2, on='subject')
print(df)
What output do you expect? It should be this:
subject value other
0 a 1 6
1 a 2 6
2 b 3 7
3 b 3 8
In your case, we can only assume that, when combined, only 62611728 records could actually be constructed with matching 'subject' - the rest of the records in either d or e had subjects which had no match in the other.
You only see the records that have the combined values from both dataframes, but only those that share the value for 'subject'. Any non-matching 'subject' records are left out, on either side (it's an 'inner' join).
Look at the documentation for the other variants. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Related

Trying to combine multiple lists of dictionaries into single Pandas DataFrame

I am using a webservice that returns inference data about submitted images in the form of:
{'IMG_123.jpg' : [{'keyword': value, 'score': value}, {'keyword': value, 'score': value}, {'keyword': value, 'score': value}]}
Like this:
https://i.stack.imgur.com/FEDqU.png
I want to combine multiple queries into a single dataframe such that the columns are the names of the Images, and the indices are the "keyword" values with the datapoints being the value of the "score".
I have been able to transform the data into, I think, a more useable format using this code:
d={}
for k, v in mydict.items():
d[k] = [{i['keyword']:i['score']} for i in v]
print(pd.DataFrame(d['IMG_1221.JPG']).T)
But this returns: https://i.stack.imgur.com/c3R0l.png
I am not sure how to combine multiple images into the format I am looking for, and the above code does not format my columns in a useful way.
The service returns keyword values that are not consistent across all images, such that the returned list of dicts will be differing sizes and keys. I would like to have a NaN or 0 value for any keys that do not exist for a given image but do for other images in the dataframe.
Any help is much appreciated!
IIUC, you want something like this:
import pandas as pd
mydict = {'IMG_1.JPG': [
{'keyword': 'a', 'score': 1},
{'keyword': 'b', 'score': 2},
{'keyword': 'c', 'score': 3}]}
mydict2 = {'IMG_2.JPG': [
{'keyword': 'a', 'score': 1},
{'keyword': 'b', 'score': 2},
{'keyword': 'd', 'score': 3}]
}
mydicts = [mydict, mydict2]
df_all = pd.DataFrame()
for d in mydicts:
key = list(d.keys())[0]
df = pd.DataFrame(d[key]).set_index('keyword').rename(columns={'score':key})
df_all = pd.concat([df_all, df], axis=1)
print(df_all)
IMG_1.JPG IMG_2.JPG
keyword
a 1.0 1.0
b 2.0 2.0
c 3.0 NaN
d NaN 3.0

Extract only cells containing a string from pandas table and copy them into a new table

I have a huge pandas table, with many rows and columns. I want to pull all the cells that contain a specific string and create a new table containing only those. Any ideas on how to approach this?
Thank you!
Do you mean something like this?
import pandas as pd
df1 = pd.DataFrame([
{'a': 'sky is blue', 'b': 7},
{'a': 'fire is red', 'b': 9},
{'a': 'water is blue', 'b': 8},
])
df2 = df1.loc[df1.a.str.contains('blue'), :]
# df2 is now:
#
# a b
# 0 sky is blue 7
# 2 water is blue 8

Slicing out unique rows from a pandas DataFrame to store in separate DataFrame

SOLVED:
# Split and save all unique parts to separate CSV
for unique_part in df['Part'].unique():
df.loc[df['Part'] == unique_part].to_csv(f'Part_{unique_part}.csv')
I have a table containing production data on parts and the variables that were recorded during their production. I need to slice out all columns for unique part rows. I.E All columns for columns for part #1, #2, and #3 be slice and put into separate dataframes.
FORMAT:
Part | Variable1 | Variable 2 etc
1-----------X---------------X
1-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
3-----------X---------------X
3-----------X---------------X
3-----------X---------------X
I have already tried
Creating a dictionary to group by
dict = {k: v for k, v in df.groupby('Part')}
This didn't work because I couldn't properly convert from dict to DataFrame with the correct format
I also tried creating a variable to store all unique part numbers, I just don't know how to loop through the main dataframe to slice out each unique part row section
part_num = df['Part'].unique()
In summary, I need to create separate dataframes with all variable columns for each cluster of rows with unique part number ids.
You can groupby and then apply to turn each group into a list of dicts, and then turn the groupby into a dict where each key is the unique Part value.
Something like:
df = pd.DataFrame({
'Part': [1,1,1,3,3,2,2,2],
'other': ['a','b','c','d','e','f','g','h']
})
d = df.groupby('Part').apply(lambda d: d.to_dict('records')).to_dict()
print d
will print
{1: [{'Part': 1, 'other': 'a'},
{'Part': 1, 'other': 'b'},
{'Part': 1, 'other': 'c'}],
2: [{'Part': 2, 'other': 'f'},
{'Part': 2, 'other': 'g'},
{'Part': 2, 'other': 'h'}],
3: [{'Part': 3, 'other': 'd'}, {'Part': 3, 'other': 'e'}]}
Think you are on the right track with groupby
df = pd.DataFrame({"Part": [1, 1, 2, 2],
"Var1": [10, 11, 12, 13],
"Var2": [20, 21, 22, 23]})
dfg = df.groupby("Part")
df1 = dfg.get_group(1)
df2 = dfg.get_group(2)
What do you want to DO with the data? Do you really need to create a bunch of individual data frames? The example below loops through each group (each part #) and prints. You could use the same method to do something or get something from each group without creating individual data frames.
for grp in dfg.groups:
print(dfg.get_group(grp))
print()
Output:
Part Var1 Var2
0 1 10 20
1 1 11 21
Part Var1 Var2
2 2 12 22
3 2 13 23

Find Excel/CSV File's Actual Column References- Python Pandas

Let say I have an excel sheet like this,
If I read this file in pandas, I can get Column1, Column2, Column3 as headers.
However, I want to know/create an output possibly a dictionary that is like this,
{Column1: 'A', Column2: 'B', Column3: 'C'}
The reason is I have another dictionary from master mapping file (that already had the references for each column done manually) that has all the references to each Column like this,
{Column1: 'A', Column2: 'B', Column3: 'C', Column4: 'D'}
This way, I can cross check keys and values and then if there is any mismatch, I can identify those mismatches. How can I get the original column name such as A for Column1 etc.. while reading a file into pandas?? Any ideas??
You can use dict with zip to map column names to letters. Assumes you have a maximum of 26 columns.
from string import ascii_uppercase
df = pd.DataFrame(np.arange(9).reshape(3, 3),
columns=['Column1', 'Column2', 'Column3'])
d = dict(zip(df.columns, ascii_uppercase))
print(d)
{'Column1': 'A', 'Column2': 'B', 'Column3': 'C'}
For more than 26 columns, you can adapt the itertools.product solution available in Repeating letters like excel columns?
You can use the Panadas rename method to replace the dataframe column names using your existing mapping dictionary:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html
import pandas as pd
df = pd.DataFrame({'Column1': [1, 2], 'Column2': [3, 4], 'Column3': [5, 6]})
existing_mapping = {'Column1': 'A', 'Column2': 'B', 'Column3': 'C', 'Column4': 'D'}
df = df.rename(columns=existing_mapping)

pandas dataframe convert values in array of objects

I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]

Categories

Resources