Getting unknow columns from list to dataframe

Getting unknow columns from list to dataframe - python

I have my list contained other 119,554 lists in. All the lists have the same length of 334 lists. I needed to covert list into dataframe using df2=pd.DataFrame(df). The result shows (119554, 706)
I don't know why there are additional columns added. It should be (119554, 33) if I'm not wrong.
Any suggestion? Thank!

Let's say you have the following lists within another list:
lst = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]]
You can see that the list contains 4 lists having length 3. Lists do not have shape attribute.
You can convert this list into a pandas dataframe in the following way:
df = pd.DataFrame(lst)
The dataframe looks like this:
| | 0 | 1 | 2 |
|---:|----:|----:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 |
| 2 | 7 | 8 | 9 |
| 3 | 10 | 11 | 12 |
The shape of the dataframe is:
print(df.shape)
>> (4, 3)

Related

Custom header to pandas data frame while writing to excel

I have a simple data frame with a few columns
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
A B
0 1 2
1 1 3
2 4 6
what I am trying to achieve is writing to excel with a custom header which can be be multline.
So output of excel would be
| App Input | |
| -------- | -------------- |
| -------- | -------------- |
| A |B |
|data |data |
| 1 | 2 |
| 1 | 3 |
| 4 | 6 |
Any ideas how can I achieve this? I was thinking of mult index but I don't think it will work since its not a true multi index

Since headers in Excel will be simple cells with string values, you can just precede the "real" values in columns with some textual values that together with the dataframe's column names are forming the multi-line header you wish to get.
For example you could use the following values to get the desired result you provided:
df = pd.DataFrame([['', ''], ['A', 'B'], ['data', 'data'], [1, 2], [1, 3], [4, 6]], columns=['App Input', ''])

Python Pandas conditional value elimination

I am trying to drop values in one data frame based on value-based on another data frame. I would appreciate your expertise on this, please.
Data frame 1 – df1:
| A | C |
| -------- | -------------- |
| f | 10 |
| c | 15 |
| b | 20 |
| d | 30 |
| h | 35 |
| e | 40 |
-----------------------------
Data frame 2 – df2:
| A | B |
| -------- | -------------- |
| a | w |
| b | 1 |
| c | w |
| d | 1 |
| e | w |
| f | 0 |
| g | 1 |
| h | 1 |
-----------------------------
I want to modify the df1 and drop(eliminate) values in column A if corresponding values in column B is ‘w’ in df2.
Resulted data frame looks like below.
| A | C |
| -------- | -------------- |
| f | 10 |
| b | 20 |
| d | 30 |
| h | 35 |
-----------------------------

You can first create a list from df2 with the values of A that have associated value of 'w' in B, and then use isin and ~ (which means not in essentially):
a = df2.loc[df2['B'].str.contains(r'w',case=False,na=False),'A'].tolist()
b = df1[~df1['A'].isin(a)]
And get back your desired outcome:
print(b)
A C
0 f 10
2 b 20
3 d 30
4 h 35
I find this link particularly helpful if you want to read more on Python's operators:
https://www.w3schools.com/python/python_operators.asp

What you're looking for is a merge with certain conditions:
# recreating your data
>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict({'A': list('fcbdhe'), 'B': [10, 15, 20, 30, 35, 40]})
>>> df2 = pd.DataFrame.from_dict({'A': list('abcdefgh'), 'B': list('w1w1w011')})
# merge but we further need to project that to fit the desired output
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')
A B_x B_y
0 f 10 0
1 b 20 1
2 d 30 1
3 h 35 1
# what you're looking for
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')[['A', 'B_x']].rename(columns={'B_x': 'C'})
A C
0 f 10
1 b 20
2 d 30
3 h 35

I assume that your first dataframe is df1 and second dataframe is df2.
First of all, list all the values of column A of df2 that value is 'w' in column B of df2.
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
# >>>output = ['a','c','e']
It will list all the values of column A of df2 that has value 'w' in column B of df2.
Then you can drop the values of df1 whose values lies in the list of above.
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A:
df1.drop(i, axis=0, inplace=True)
Full code:
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A: #checking if A column of df1 has values similar to above list
df1.drop(i, axis=0, inplace=True)
df1

df1.merge(df2[df2['B']!='w'],how='inner').drop(["B"], axis=1)

Efficient way of reducing a Dataframe from wide to long based on the logic mentioned

I have a DataFrame with col names 'a', 'b', 'c'
#Input
import pandas as pd
list_of_dicts = [
{'a' : 0, 'b' : 4, 'c' : 3},
{'a' : 1, 'b' : 1, 'c' : 2 },
{'a' : 0, 'b' : 0, 'c' : 0 },
{'a' : 1, 'b' : 0, 'c' : 3 },
{'a' : 2, 'b' : 1, 'c' : 0 }
]
df = pd.DataFrame(list_of_dicts)
#Input DataFrame
-----|------|------|-----|
| a | b | c |
-----|------|------|-----|
0 | 0 | 4 | 3 |
1 | 1 | 1 | 2 |
2 | 0 | 0 | 0 |
3 | 1 | 0 | 3 |
4 | 2 | 1 | 0 |
I want to reduce the wide DataFrame to One column, with the column names
as DataFrame values multiplied by the corresponding row values. The operation must be done Row wise.
#Output
| Values |
-----------------
0 | b |
1 | b |
2 | b |
3 | b |
4 | c |
5 | c |
6 | c |
7 | a |
8 | b |
9 | c |
10 | c |
11 | a |
12 | c |
13 | c |
14 | c |
15 | a |
17 | a |
18 | b |
Explanation:
Row 0 in the Input DataFrame has 4 'b' and 3 'c', so the first seven elements of the output DataFrame are bbbbccc
Row 1 similarly has 1 'a' 1 'b' and 2 'c', so the output will have abcc as the next 4 elements
Row 2 has 0's across, so would be skipped entirely.
The Order of the output is very important
For example, the first row has '4' b and 3 'c', so the output DataFrame must be bbbbccc because Column 'b' comes before column 'c'. The operation must be row-wise from left to right.
I'm trying to find an efficient way in order to accomplish this. The real dataset is too big for me to compute. Please provide the python3 solution.

Stack the data (you could melt as well), and drop rows where the count is zero. Finally use numpy.repeat to build a new array, and build your new dataframe from that.
reshape = df.stack().droplevel(0).loc[lambda x: x != 0]
pd.DataFrame(np.repeat(reshape.index, reshape), columns=['values'])
values
0 b
1 b
2 b
3 b
4 c
5 c
6 c
7 a
8 b
9 c
10 c
11 a
12 c
13 c
14 c
15 a
16 a
17 b

I don't think pandas buys you anything in this process, and especially if you have a large amount of data you don't want to read that all into memory and reprocess it into another large data structure.
import csv
with open('input.csv', 'r') as fh:
reader = csv.DictReader(fh)
for row in reader:
for key in reader.headers:
value = int(row[key])
for i in range(value):
print(key)

Creating dictionary from excel file (pandas dataframe)

I've got excel/pandas dataframe/file looking like this:
+------+--------+
| ID | 2nd ID |
+------+--------+
| ID_1 | R_1 |
| ID_1 | R_2 |
| ID_2 | R_3 |
| ID_3 | |
| ID_4 | R_4 |
| ID_5 | |
+------+--------+
How can I transform it to python dictionary? I want my result to be like:
{'ID_1':['R_1','R_2'],'ID_2':['R_3'],'ID_3':[],'ID_4':['R_4'],'ID_5':[]}
What should I do, to obtain it?

If need remove missing values for not exist values use Series.dropna in lambda function in GroupBy.apply:
d = df.groupby('ID')['2nd ID'].apply(lambda x: x.dropna().tolist()).to_dict()
print (d)
{'ID_1': ['R_1', 'R_2'], 'ID_2': ['R_3'], 'ID_3': [], 'ID_4': ['R_4'], 'ID_5': []}
Or use fact np.nan == np.nan return False in list compehension for filter non missing values, check also warning in docs for more explain.
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y == y]).to_dict()
If need remove empty strings:
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y != '']).to_dict()

Apply a function over the dataframe over the rows which appends the value to your dict. Apply is not inplace and thus your dictionary would be created.
d = dict.fromkeys(df.ID.unique(), [])
def func(x):
d[x.ID].append(x["2nd ID"])
# will return a series of Nones
df.apply(func, axis = 1)
Edit:
I asked it on Gitter and #gurukiran07 gave me an answer. What you are trying to do is reverse of explode function
s = pd.Series([[1, 2, 3], [4, 5]])
0 [1, 2, 3]
1 [4, 5]
dtype: object
exploded = s.explode()
0 1
0 2
0 3
1 4
1 5
dtype: object
exploded.groupby(level=0).agg(list)
0 [1, 2, 3]
1 [4, 5]
dtype: object

How do I remove the last row in the printed list?

The question asked of me is to do the following
Print the 2-dimensional list mult_table by row and column. Hint: Use nested loops. Sample output for the given program:
1 | 2 | 3
2 | 4 | 6
3 | 6 | 9
So far I have this:
mult_table = [
[1, 2, 3],
[2, 4, 6],
[3, 6, 9]
]
for row in mult_table:
for cell in row:
print(cell, end=' | ')
print()
The output this gives me is:
1 | 2 | 3 |
2 | 4 | 6 |
3 | 6 | 9 |
I need to know how I can remove the last column of | that is being printed.
Thank you for your help in advance.

You can use the str.join method instead of always printing a pipe as an ending character:
for row in mult_table:
print(' | '.join(map(str, row)))
Or you can use the sep parameter:
for row in mult_table:
print(*row, sep=' | ')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting unknow columns from list to dataframe - python

Related

Custom header to pandas data frame while writing to excel

Python Pandas conditional value elimination

Efficient way of reducing a Dataframe from wide to long based on the logic mentioned

Creating dictionary from excel file (pandas dataframe)

How do I remove the last row in the printed list?

Categories

Resources