The Fastest way to create a dataframe sample from StackOverflow content [duplicate] - python

This question already has answers here:
How to copy/paste DataFrame from Stack Overflow into Python
(3 answers)
Closed 4 years ago.
For pandas related question on StackOverflow , people usually provide their sample data like below:
a b c d e
0 -0.420430 -0.394562 0.760232 0.152246 -0.671229
1 0.388447 0.676054 -0.058273 -0.246588 0.811332
2 -0.498263 -0.108011 0.952489 0.504729 -0.385724
3 1.069371 0.143752 0.414916 -1.180362 -0.029045
4 -0.245684 -0.150180 0.210579 0.063154 0.261488
5 0.064939 -0.396667 0.857411 -0.460206 0.039658
What's the most efficient way to create the data in my own jupyer notebook, so I can further investigate the question?
Usually, I will copy the data about to notepad and replace the space with comma and do the following code to create the sameple data:
data = np.array([-0.420430,-0.394562,0.760232,0.152246,...]) # paste the result from notepad here
df = pd.DataFrame(data.reshape(-1,5),columns=[HEADERS_OF_DATA]) # 5 is number of columns
However, this is quite slow and inconvenient. Is there any faster way to do so?

Wonderfully, you can do this with pd.read_clipboard().
Just copy the posted DataFrame from the question, and then this line of code will parse it as a DataFrame using pd.read_table():
df = pd.read_clipboard()

Related

How can I separate text into multiple values in a CSV file using Python? [duplicate]

This question already has answers here:
Split one column into multiple columns by multiple delimiters in Pandas
(2 answers)
Closed 1 year ago.
I'd like to begin processing some data for analysis but I have to separate the responses into multiple values. Currently each column contains one value that is combined with 3 responses, Agree: #score, Disagree: #score, Neither agree nor disagree. I'd like to separate the responses from the column into individual values to create an analysis for a visualization. Would I need to include regular expression to do this?
So far that code I have is just to load the data with some libraries I plan to use:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def load_data():
# importing datasets
df=pd.read_csv('dataset.csv')
return df
load_data().head()
You need to use str.split(';') to first split the values into multiple columns. Then for each column value, split the string again using str.split(':') but take [-1] part of it.
Here's how you can do it.
import pandas as pd
df = pd.DataFrame({'username':['Dragonfly','SpeedHawk','EagleEye'],
'Question1':['Comfortable:64;Neither comfortable nor uncomfortable:36',
'Comfortable:0;Neither comfortable nor uncomfortable:100',
'Comfortable:10;Neither comfortable nor uncomfortable:90'],
'Question2':['Agree:46;Disagree:13;Neither agree nor disagree:41',
'Agree:96;Disagree:0;Neither agree nor disagree:4',
'Agree:90;Disagree:5;Neither agree nor disagree:5']})
df[['Q1_Comfortable','Q1_Neutral']] = df['Question1'].str.split(';',expand=True)
df[['Q2_Agree','Q2_Disagree','Q2_Neutral']] = df['Question2'].str.split(';',expand=True)
df.drop(columns=['Question1','Question2'],inplace=True)
for col in df.columns[1:]:
df[col] = df[col].str.split(':').str[-1]
print (df)
The output of this will be:
username Q1_Comfortable Q1_Neutral Q2_Agree Q2_Disagree Q2_Neutral
0 Dragonfly 64 36 46 13 41
1 SpeedHawk 0 100 96 0 4
2 EagleEye 10 90 90 5 5

How to get the output to show up for all fields in a Jupyter notebook? [duplicate]

This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I am sure this has been asked a million times, but I must be googling the wrong thing. I am playing with a Kaggle dataset that is multidimensional (81 fields). The function is simple:
def calc_missing_data(df):
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
return missing_data
calc_missing_data(df)
But the output is limited to only part of the fields:
Is there a way to see all the outputs? Thank you.
You can set the option to show all columns by setting the max_columns shown to none:
import pandas as pd
pd.set_option('display.max_columns', None)
should work.

How to load a dataframe from a printed dataframe string? [duplicate]

This question already has answers here:
Create Pandas DataFrame from a string
(7 answers)
How to make good reproducible pandas examples
(5 answers)
Closed 3 years ago.
Often people ask questions on Stack Overflow with an output of print(dataframe). It is convenient if one has a way of quickly loading the dataframe data into a pandas.dataframe object.
What is/are the most suggestible ways of loading a dataframe from a dataframe-string (which may or may not be properly formatted)?
Example-1
If you want to load the following string as a dataframe what would you do?
# Dummy Data
s1 = """
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
"""
Example-2
This type is more similar to what you find in csv file.
# Dummy Data
s2 = """
Client, NumberOfProducts, ID
A, 1, 2
A, 5, 1
B, 1, 2
B, 6, 1
C, 9, 1
"""
Expected Output
References
Note: The following two links do not address the specific situation presented in Example-1. The reason I think my question is not a duplicate is that I think one cannot load the string in Example-1 using any of the solutions already posted on those links (at the time of writing).
Create Pandas DataFrame from a string. Note that pd.read_csv(StringIO(s1), sep), as suggested here, doesn't really work for Example-1. You get the following output.
This question was marked as a duplicate of two Stack Overflow links. One of them is the one above, which fails in addressing the case presented in Example-1. And the second one is . Among all the answers presented there, only one looked like it might work for Example-1, but it did not work.
# could not read the clipboard and threw error
pd.read_clipboard(sep='\s\s+')
Error Thrown:
PyperclipException:
Pyperclip could not find a copy/paste mechanism for your system.
For more information, please visit https://pyperclip.readthedocs.org
I can suggest two methods to approach this problem.
Method-1
Process the string with regex and numpy to make the dataframe. What I have seen is that this works most of the time. This would for the case presented in "Example-1".
# Make Dataframe
import pandas as pd
import numpy as np
import re
# Make Dataframe
# s = s1
ncols = 3 # number_of_columns
ss = re.sub('\s+',',',s.strip())
sa = np.array(ss.split(',')).reshape(-1,ncols)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df
Method-2
Use io.StringIO to feed into pandas.read_csv(). But this would work if the separator is well defined. For instance, if your data looks similar to "Example-2". Source credit
import pandas as pd
from io import StringIO
# Make Dataframe
# s = s2
df = pd.read_csv(StringIO(s), sep=',')
Output

Pandas: Add a scalar to multiple new columns in an existing dataframe [duplicate]

This question already has answers here:
How to add multiple columns to pandas dataframe in one assignment?
(13 answers)
Closed 4 years ago.
I recently answered a question where the OP was looking multiple columns with multiple different values to an existing dataframe (link). And it's fairly succinct, but I don't think very fast.
Ultimately I was hoping I could do something like:
# Existing dataframe
df = pd.DataFrame({'a':[1,2]})
df[['b','c']] = 0
Which would result in:
a b c
1 0 0
2 0 0
But it throws an error.
Is there a super simple way to do this that I'm missing? Or is the answer I posted earlier the fastest / easiest way?
NOTE
I understand this could be done via loops, or via assigning scalars to multiple columns, but am trying to avoid that if possible. Assume 50 columns or whatever number you wouldn't want to write:
df['b'], df['c'], ..., df['xyz'] = 0, 0, ..., 0
Not a duplicate:
The "Possible duplicate" question suggested to this shows multiple different values assigned to each column. I'm simply asking if there is a very easy way to assign a single scalar value to multiple new columns. The answer could correctly and very simply be, "No" - but worth knowing so I can stop searching.
Why not using assign
df.assign(**dict.fromkeys(['b','c'],0))
Out[781]:
a b c
0 1 0 0
1 2 0 0
Or create the dict by d=dict(zip([namelist],[valuelist]))
I think you want to do
df['b'], df['c'] = 0, 0

Performance of Pandas string contains for column [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()?
I'm able to do it like this:
start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start
But it's taking 34.013s. Is there any faster way?
Edit:
b["mask"] looks like:
'PR347856|P5478'
'BS7623|B5763'
and i want the count of occurances for each mask, so i can't join them.
Edit:
a["text"] contains strings of the size of ~ 3 sentences
Maybe you can vectorize the containment operation.
text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())

Categories

Resources