Find Excel/CSV File's Actual Column References- Python Pandas - python

Let say I have an excel sheet like this,
If I read this file in pandas, I can get Column1, Column2, Column3 as headers.
However, I want to know/create an output possibly a dictionary that is like this,
{Column1: 'A', Column2: 'B', Column3: 'C'}
The reason is I have another dictionary from master mapping file (that already had the references for each column done manually) that has all the references to each Column like this,
{Column1: 'A', Column2: 'B', Column3: 'C', Column4: 'D'}
This way, I can cross check keys and values and then if there is any mismatch, I can identify those mismatches. How can I get the original column name such as A for Column1 etc.. while reading a file into pandas?? Any ideas??

You can use dict with zip to map column names to letters. Assumes you have a maximum of 26 columns.
from string import ascii_uppercase
df = pd.DataFrame(np.arange(9).reshape(3, 3),
columns=['Column1', 'Column2', 'Column3'])
d = dict(zip(df.columns, ascii_uppercase))
print(d)
{'Column1': 'A', 'Column2': 'B', 'Column3': 'C'}
For more than 26 columns, you can adapt the itertools.product solution available in Repeating letters like excel columns?

You can use the Panadas rename method to replace the dataframe column names using your existing mapping dictionary:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html
import pandas as pd
df = pd.DataFrame({'Column1': [1, 2], 'Column2': [3, 4], 'Column3': [5, 6]})
existing_mapping = {'Column1': 'A', 'Column2': 'B', 'Column3': 'C', 'Column4': 'D'}
df = df.rename(columns=existing_mapping)

Related

Dataframe to a dictionary as some columns in a list as keys and one as value

I have a pandas dataframe df that looks like this:
col1 col2 col3
A X 1
B Y 2
C Z 3
I want to convert this into a dictionary with col1 and col2 in a list as key and col3 as value. So, the output would look like this:
{
['A', 'X']: 1,
['B', 'Y']: 2,
['C', 'Z']: 3
}
How do I get my desired output?
There are several restrictions while defining dictionary keys. Lists or dictionaries can not be a dictionary's keys, because they are mutable - unhashable. Meaning they can change and we can not track them, sort of like they do not have a unique hashcode. Thus, you can not set lists as dictionary keys. But, you can set tuples as dictionary keys. Tuples are very similar to lists. Let's make your dataframe again:
import pandas as pd
data = {'col1':['A','B','C'],'col2':['X','Y','Z'],'col3':[1,2,3]}
df = pd.DataFrame(data)
Now, we have the same dataframe. Now, let's use list comprehension method to go (iterate) through all the rows of the dataframe, while selecting column1 and column2 as tuple keys and column3 as values:
my_dict = {(df.iloc[i,0],df.iloc[i,1]): df.iloc[i,2] for i in range(len(df))}
Now, you have the following output:
my_dict = {('A', 'X'): 1, ('B', 'Y'): 2, ('C', 'Z'): 3}
Here's a way to do this using pandas methods:
After creating the dataframe:
import pandas as pd
df = pd.DataFrame([['A', 'X', 1], ['B', 'Y', 2], ['C', 'Z', 3]],
columns=['col1', 'col2', 'col3'])
Set the index to the columns that form the dictionary keys, select the column for the values and convert it to a dictionary:
df.set_index(['col1', 'col2']).col3.to_dict()
To get the required result:
{('A', 'X'): 1, ('B', 'Y'): 2, ('C', 'Z'): 3}

merging two dataset with python

first dataset: dim(d)=(70856886 12), Second dataset: dim(e)=(354 6)
both data set have common variable which is subject and I want to merge both dataset by subject, I used this code by python:
# Merging both dataset:
data=pd.merge(d, e, on='subject')
When I do that I lost some data set my dim of my new merging dataset is 62611728
my question is why I am losing those observation?? [70856886- 62611728= 8245158]
As the documentation states, pd.merge() "Merge[s] DataFrame or named Series objects with a database-style join."
In general, it's a good idea to try something on a small dataset to see if you understand its function correctly and then to apply it to a large dataset.
Here's an example for pd.merge():
import pandas as pd
df1 = pd.DataFrame([
{'subject': 'a', 'value': 1},
{'subject': 'a', 'value': 2},
{'subject': 'b', 'value': 3},
{'subject': 'c', 'value': 4},
{'subject': 'c', 'value': 5},
])
df2 = pd.DataFrame([
{'subject': 'a', 'other': 6},
{'subject': 'b', 'other': 7},
{'subject': 'b', 'other': 8},
{'subject': 'd', 'other': 9}
])
df = pd.merge(df1, df2, on='subject')
print(df)
What output do you expect? It should be this:
subject value other
0 a 1 6
1 a 2 6
2 b 3 7
3 b 3 8
In your case, we can only assume that, when combined, only 62611728 records could actually be constructed with matching 'subject' - the rest of the records in either d or e had subjects which had no match in the other.
You only see the records that have the combined values from both dataframes, but only those that share the value for 'subject'. Any non-matching 'subject' records are left out, on either side (it's an 'inner' join).
Look at the documentation for the other variants. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Filtering a dataframe using a dictionary's values

I have a dictionary mapping strings to lists of strings, for example:
{'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
I am looking to use this to filter a dataframe, creating new dfs with the name of the df being the key and the columns to be copied containing the string listed as the values.
So dataframe A would contain columns from the original dataframe that contain 'A', dataframe B would contain columns that contain 'A', 'B', 'C'. I know that I need to use regex filtering for selecting the columns but am unsure how to do this.
Use DataFrame.filter with regex - join values by | for regex or - it means for key C are selected columns with B or E or C:
d = {'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
dfs = {k:df.filter(regex='|'.join(v)) for k, v in d.items()}

Creating variable number of lists from pandas dataframes

I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks
It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']
import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())

Converting dataframe to list and dictionary

I have the following dataframe:
Region Name Price
0 ny A 53.00
1 ln B 52.23
2 ln B 51.20
3 tk C 50.50
I want to convert the data to a list within a list, and the name and price field into a dictionary.
Name field is repeated, but I would like to get the unique values. Then assign the prices to the key.
Something like this: [Region,{Name:Price}]
For example:
[[ny, {"A": array([53.00])}],[ln, {"B": array([52.23 , 51.20])}],[tk, {"C": array([50.50]}]]
Can anyone suggest me a way to execute it?
Thanks.
You could put Region, Name as a MultiIndex and export to_dict, something like that:
df = pd.DataFrame({
'Region': ['ny', 'ln', 'ln', 'tk'],
'Name': ['A', 'B', 'B', 'C'],
'Price': [53, 52, 51, 50]
})
# First, combined same values for Region/Name pair into a list
df_grouped = df.groupby(['Region', 'Name']).Price.apply(list).to_frame()
# Second, create a nice dictionary
df_grouped.groupby(level=0).apply(
lambda df: df.xs(df.name).to_dict()['Price']
).to_dict()
>>> {'ln': {'B': [52, 51]}, 'ny': {'A': [53]}, 'tk': {'C': [50]}}

Categories

Resources