import pandas as pd
pincodes = [800678,800456]
numbers = [2567890, 256757]
labels = ['R','M']
first = pd.DataFrame({'Number':numbers, 'Pincode':pincodes},
index=labels)
print(first)
The above code gives me the following (correct) dataframe.
Number Pincode
R 2567890 800678
M 256757 800456
But, when I use this statement,
second = pd.DataFrame([numbers,pincodes],
index=labels, columns=['Number','Pincode'])
print(second)
then I get the following (incorrect) output.
Number Pincode
R 2567890 256757
M 800678 800456
As you can see, the two Data Frames are different. Why does this happen? What's so different in this dictionary vs list approach?
The constructor of pd.DataFrame() includes this documentation.
Init signature: pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Docstring:
...
Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects
.. versionchanged :: 0.23.0
If data is a dict, column order follows insertion-order for
Python 3.6 and later.
.. versionchanged :: 0.25.0
If data is a list of dicts, column order follows insertion-order
for Python 3.6 and later.
The key word is column. In the first approach, you correctly tell pandas that numbers is the column with label 'Numbers'. But in the second approach, you tell pandas that the columns are 'Numbers' and 'Pincode' and to get the data from the list of lists [numbers, pincodes]. The first column of this list of lists is assigned to the 'Numbers' column, and the second to the 'Pincode' column.
If you want to enter your data this way (not as a dictionary), you need to transpose the list of lists.
>>> import numpy as np
# old way
>>> pd.DataFrame([numbers,pincodes],
index=labels,columns=['Number','Pincode'])
Number Pincode
R 2567890 256757
M 800678 800456
# Transpose the data instead so the rows become the columns.
>>> pd.DataFrame(np.transpose([numbers,pincodes]),
index=labels,columns=['Number','Pincode'])
Number Pincode
R 2567890 800678
M 256757 800456
Related
I am extracting a HTML Table from Web with Pandas.
In this result (List of Dataframe Objects) I want to return all Dataframes where the Cell Value is an Element of an given Array.
So far I am struggling to call only one one column value and not the whole Object.
Syntax of Table: (the Header Lines are not extracted correctly so this i the real Output)
0
1
2
3
Date
Name
Number
Text
09.09.2022
Smith Jason
3290
Free Car Wash
12.03.2022
Betty Paulsen
231
10l Gasoline
import pandas as pd
import numpy as np
url = f'https://some_website.com'
df = pd.read_html(url)
arr_Nr = ['3290', '9273']
def correct_number():
for el in df[0][1]:
if (el in arr_Nr):
return True
def get_winner():
for el in df:
if (el in arr_Nr):
return el
print(get_winner())
With the Function
correct_number()
I can output that there is a Winner, but not the Details, when I try to access
get_winner()
EDIT
So far I now think I got one step closer: The function read_html() returns a list of DataFrame Objects. In my example, there is only one table so accessing it via df = dfs[0] I should get the correct DataFrame Object.
But now when I try the following, the Code don't work as expected, there is no Filter applied and the Table is returned in full:
df2 = df[df.Number == '3290']
print(df2)
Okay i finally figured it out:
Pandas returned List of DataFrame Objects, in my example there is only one table, to access this Table aka the DataFrame Object I had to access it first.
Before I then could compare the Values, I parsed them to integers, Pandas seemed to extract them as char, so my Array couldn't compare them properly.
In the End the code looks way more elegant that I thought before:
import pandas as pd
import numpy as np
url = f'https://mywebsite.com/winners-2022'
dfs_list = pd.read_html(url, header =0, flavor = 'bs4')
df = dfs_list[0]
winner_nrs = [3290, 843]
result = df[df.Losnummer.astype(int).isin(winner_nrs)]
I'm new to pandas and I want to know if there is a way to map a column of lists in a dataframe to values stored in a dictionary.
Lets say I have the dataframe 'df' and the dictionary 'dict'. I want to create a new column named 'Description' in the dataframe where I can see the description of the Codes shown. The values of the items in the column should be stored in a list as well.
import pandas as pd
data = {'Codes':[['E0'],['E0','E1'],['E3']]}
df = pd.DataFrame(data)
dic = {'E0':'Error Code', 'E1':'Door Open', 'E2':'Door Closed'}
Most efficient would be to use a list comprehension.
df['Description'] = [[dic.get(x, None) for x in l] for l in df['Codes']]
output:
Codes Description
0 [E0] [Error Code]
1 [E0, E1] [Error Code, Door Open]
2 [E3] [None]
If needed you can post-process to replace the empty lists with NaN, use an alternative list comprehension to avoid non-matches: [[dic[x] for x in l if x in dic] for l in df['Codes']], but this would probably be ambiguous if you have one no-match among several matches (which one is which?).
I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(
If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)
There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays
I want to split data in two columns from a data frame and construct new columns using this data.
My data frame is,
dfc = pd.DataFrame( {"A": ["GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:GL", "GT:DP:GL"], "B": ["0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "1/1:49:-103.754,0,-3.51307", "1/1:49:-103.754,0,-3.51307"]} )
I want individual columns named GT, DP, RO, QR, AO, QA, GL with values from column B
I want to produce output as,
We can split the two columns using a = df.A.str.split(":", expand = True)and b = df.B.str.split(":", expand = True) to get two individual data frames. These can be merged with c = pd.merge(a, b, left_index = True, right_index = True) to get all desired data. But, not in the format as expected.
Any suggestions ? I think better way can be using split on both columns A and B and then creating a dictcolumn with values from A as key and B as values. Then this column can be converted to data frame.
Thanks
Use an OrderedDict to preserve the order after creating a dict mapping of the two concerned columns of the dataframe split on the sep ":", flattened to a list.
Feed this to the dataframe constructor later.
from collections import OrderedDict
L = dfc.apply(
lambda x: OrderedDict(zip(x['A'].split(':'), x['B'].split(':'))), 1).tolist()
pd.DataFrame(L)
I'm going to split everything by ':'. But I have 2 columns. If I stack first, I get a series in which I can more easily use str.split
I now have a split series in which I can group by level=0 which is the original index.
I zip and dict to get series like structures with the original column A as the indices and B as the values.
unstack and I'm done.
gb = dfc.stack().str.split(':').groupby(level=0)
gb.apply(lambda x: dict(zip(*x))).unstack()
I have a data set with several dozen columns and am sorting two columns in question by Max value and storing the result in a variable to print it later to a report. How do I only return the Two columns so that they are on the same as my string "Max". Below is the method I am using which returns the ID # in my variable also.
#Create DF
prim1 = mru[['Time', 'Motion:MRU']]
# Sort
prim1 = prim1.sort(['Motion:MRU'], ascending=True)
primmin = prim1['Motion:MRU'].min()
print 'Max: ', prim1[:1]
Basically what you see printed will be a pandas Series in the form of :
<index> <value>
If you want just the value then you access the numpy array data attribute by doing this:
print 'Max: ', prim1[:1].values[0]
This will return a numpy array with a single element and then to access the scalar value you subscript the single value using [0]