extract value from pandas dataframe [duplicate] - python

This question already has answers here:
Extract int from string in Pandas
(8 answers)
Closed 1 year ago.
Below is the dataframe
import pandas as pd
import numpy as np
d = {'col1': ['Get URI||1621992600749||com.particlenews.newsbreak||https://graph.fb.com||2021-05-26 01:30:00||1.3.0-QA-1100||90',
'Get URI||1621992600799||com.particlenews.newsbreak||https://graph.fb.com||2021-05-26 01:30:00||1.3.0-QA-1100||90']}
df = pd.DataFrame(data=d)
and need to extract the "1621992600749" and "1621992600799" values.
i have done it multiple ways , by using the split function
new = df["col1"].str.split("||", n = 1, expand = True)
but doesnt give the expected results, any thoughts will be helpful.

You cna use the extract with regex
df['col1'].str.extract(r'(\d+)')
#output
0
0 1621992600749
1 1621992600799

Related

How to extract a value after colon in all the rows from a pandas dataframe column? [duplicate]

This question already has answers here:
access value from dict stored in python df
(3 answers)
Closed 3 months ago.
Edit: the dummy dataframe is edited
I have a pandas data frame with the below kind of column with 200 rows.
Let's say the name of df is data.
-----------------------------------|
B
-----------------------------------|
{'animal':'cat', 'bird':'peacock'...}
I want to extract the value of animal to a separate column C for all the rows.
I tried the below code but it doesn't work.
data['C'] = data["B"].apply(lambda x: x.split(':')[-2] if ':' in x else x)
Please help.
The dictionary is unpacked with pd.json_normalize
import pandas as pd
data = pd.DataFrame({'B': [{0: {'animal': 'cat', 'bird': 'peacock'}}]})
data['C'] = pd.json_normalize(data['B'])['0.animal']
I'm not totally sure of the structure of your data. Does this look right?
import pandas as pd
import re
df = pd.DataFrame({
"B": ["'animal':'cat'", "'bird':'peacock'"]
})
df["C"] = df.B.apply(lambda x: re.sub(r".*?\:(.*$)", r"\1", x))

One-liner to identify duplicates using pandas? [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 1 year ago.
In preps for data analyst interview questions, I came across "find all duplicate emails (not unique emails) in "one-liner" using pandas."
The best I've got is not a single line but rather three:
# initialize dataframe
import pandas as pd
d = {'email':['a','b','c','a','b']}
df= pd.DataFrame(d)
# select emails having duplicate entries
results = pd.DataFrame(df.value_counts())
results.columns = ['count']
results[results['count'] > 1]
>>>
count
email
b 2
a 2
Could the second block following the latter comment be condensed into a one-liner, avoiding the temporary variable results?
Just use duplicated:
>>> df[df.duplicated()]
email
3 a
4 b
Or if you want a list:
>>> df[df["email"].duplicated()]["email"].tolist()
['a', 'b']

How to stop Pandas from rounding and changing numbers? [duplicate]

This question already has answers here:
Why is python pandas dataframe rounding my values?
(5 answers)
Closed 3 years ago.
I'm trying to load and extract data from a CSV with pandas and I'm noticing that it is changing the numbers loaded. How do I prevent this?
I've got a CSV, test.csv:
q,a,b,c,d,e,f
z,0.999211563,0.945548791,0.756781883,0.572315951,1.191243688,0.867855435
Here I load data:
df = pd.read_csv("test.csv")
print(df)
This outputs the following rounded figures:
q a b c d e f
0 z 0.999212 0.945549 0.756782 0.572316 1.191244 0.867855
What I ultimate want to do is access values by position:
print(df_.iloc[0, [1, 2, 3, 4, 5, 6]].tolist())
But this is adding numbers to some of the figures.
[0.999211563, 0.9455487909999999, 0.7567818829999999, 0.572315951, 1.191243688, 0.867855435]
Pandas is altering my data. How can I stop pandas from rounding, and adding numbers to figures?
import pandas as pd
with pd.option_context('display.precision', 10):
df = pd.read_csv("test.csv", float_precision=None)
print(df)

How doing division each cell dataframe [duplicate]

This question already has answers here:
Pandas sum across columns and divide each cell from that value
(5 answers)
Closed 3 years ago.
I want calculate division of each cell by sum of each row. Actually there are many column not only A and B.
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1],
'B':[4,5,6,4,5,6,4]]})
sum_row = data.sum(axis=1)
Here is an example of what I expect.
I think this should do the trick
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1],
'B':[4,5,6,4,5,6,4]})
data['sum_row'] = data.sum(axis=1)
for col in list(data.columns.values):
data[col + ' / Sum_Row'] = [data['A'].iloc[e] / data['sum_row'].iloc[e] for e in range(0, len(data['A']))]

remove rows from dataframe where contents could be a choice of strings [duplicate]

This question already has answers here:
dropping rows from dataframe based on a "not in" condition [duplicate]
(2 answers)
Closed 4 years ago.
so i can do something like:
data = df[ df['Proposal'] != 'C000' ]
to remove all Proposals with string C000, but how can i do something like:
data = df[ df['Proposal'] not in ['C000','C0001' ]
to remove all proposals that match either C000 or C0001 (etc. etc.)
You can try this,
df = df.drop(df[df['Proposal'].isin(['C000','C0001'])].index)
Or to select the required ones,
df = df[~df['Proposal'].isin(['C000','C0001'])]
import numpy as np
data = df.loc[np.logical_not(df['Proposal'].isin({'C000','C0001'})), :]
# or
data = df.loc[ ~df['Proposal'].isin({'C000','C0001'}) , :]

Categories

Resources