Convert text file into dataframe with custom multiple delimiter in python - python

i'am new to python. I have one txt file. it contains some data like
0: 480x640 2 persons, 1 cat, 1 clock, 1: 480x640 2 persons, 1 chair, Done. date (0.635s) Tue, 05 April 03:54:02
0: 480x640 3 persons, 1 cat, 1 laptop, 1 clock, 1: 480x640 4 persons, 2 chairs, Done. date (0.587s) Tue, 05 April 03:54:05
0: 480x640 3 persons, 1 chair, 1: 480x640 4 persons, 2 chairs, Done. date (0.582s) Tue, 05 April 03:54:07
i used to convert it into pandas dataframe with multiple delimiter
i tried code :
import pandas as pd
`student_csv = pd.read_csv('output.txt', names=['a', 'b','date','status'], sep='[0: 480x640, 1: 480x640 , date]')
student_csv.to_csv('txttocsv.csv', index = None)`
Now how to convert it into pandas dataframe like this...
a b c
2 persons 2 persons, Done Tue, 05 April03:54:02
How to convert text file into dataframe

It's tricky to know exactly what are your rules for splitting. You can use a regex as delimiter.
Here is a working example to split the lists and date as columns, but you'll probably have to tweak it to your exact rules:
df = pd.read_csv('output.txt', sep=r'(?:,\s*|^)(?:\d+: \d+x\d+|Done[^)]+\)\s*)',
header=None, engine='python', names=(None, 'a', 'b', 'date')).iloc[:, 1:]
output:
a b date
0 2 persons, 1 cat, 1 clock 2 persons, 1 chair Tue, 05 April 03:54:02
1 3 persons, 1 cat, 1 laptop, 1 clock 4 persons, 2 chairs Tue, 05 April 03:54:05
2 3 persons, 1 chair 4 persons, 2 chairs Tue, 05 April 03:54:07

You can use | in sep argument for multiple delimiters
df = pd.read_csv('data.txt', sep=r'0: 480x640|1: 480x640|date \(.*\)',
engine='python', names=('None', 'a', 'b', 'c')).drop('None', axis=1)
print(df)
a b \
0 2 persons, 1 cat, 1 clock, 2 persons, 1 chair, Done.
1 3 persons, 1 cat, 1 laptop, 1 clock, 4 persons, 2 chairs, Done.
2 3 persons, 1 chair, 4 persons, 2 chairs, Done.
c
0 Tue, 05 April 03:54:02
1 Tue, 05 April 03:54:05
2 Tue, 05 April 03:54:07

Related

Replace values in pandas dataframe with blank space that start with a string value

I have a large pandas dataframe (15million lines) and I want to replace any value that starts with 'College' and replace it with a blank. I know I could do this with a for loop or 'np.where', but this takes way too long on my large dataframe. I also want to create a 'combined_id' column where I take the student name and the college. I want to skip the ones that don't have a proper college name. What is the fastest way to do this?
original:
id1 id2 college_name student combined id
0 01 01 Stanford haley id/haley_Stanford
1 01 02 College12 josh id/josh_College12
2 01 03 Harvard jake id/jake_Harvard
2 01 05 UPenn emily id/emily_UPenn
2 01 00 College10 sarah id/sarah_College10
desired:
id1 id2 college_name student combined id
0 01 01 Stanford haley id/haley_Stanford
1 01 02 josh
2 01 03 Harvard jake id/jake_Harvard
2 01 05 UPenn emily id/emily_UPenn
2 01 00 sarah
Use boolean indexing:
m = df['college_name'].str.startswith('College')
df.loc[m, 'college_name'] = ''
df.loc[m, 'combined id'] = ''
Or if "combined id" does not exist, you have to use numpy.where:
df['combined id'] = np.where(m, '', 'id/'+df['student']+'_'+df['college_name'])
Here's a way to get from original to desired in your question:
df.loc[df.college_name.str.startswith("College"), ['college_name', 'combined_id']] = ''
Output:
id1 id2 college_name student combined_id
0 1 1 Stanford haley id/haley_Stanford
1 1 2 josh
2 1 3 Harvard jake id/jake_Harvard
2 1 5 UPenn emily id/emily_UPenn
2 1 0 sarah

Transform Hot Encoding

I have this data;
ID Month
001 June
001 July
001 August
002 July
I want the result to be like this:
ID June July August
001 1 1 1
002 0 1 0
I have tried one-hot encoding, my query is like this:
one_hot = pd.get_dummies(frame['month'])
frame = frame.drop('Month',axis = 1)
frame = frame.join(one_hot)
However, the result is like this
ID June July August
001 1 0 0
001 0 1 0
001 0 0 1
002 0 1 0
May I know which part of my query is wrong?
get_dummies returns strictly 1-hot encoded values, you can use pd.crosstab:
>>> out = pd.crosstab(df.ID, df.Month)
>>> out
Month August July June
ID
1 1 1 1
2 0 1 0
To preserve the order of appearance of Months, you can reindex:
>>> out.reindex(df.Month.unique(), axis=1)
Month June July August
ID
1 1 1 1
2 0 1 0
In case an ID can have more than 1 month associated with it and you want to see it as 1:
out = out.ne(0).astype(int)
can be used afterwards.
If need hot encoding convert ID to index and aggregate max for always 0,1 ouput:
one_hot = (pd.get_dummies(frame.set_index('ID')['Month'])
.max(level=0)
.reindex(df.Month.unique(), axis=1))
print (one_hot)
June July August
ID
1 1 1 1
2 0 1 0

How to filter certain values in consecutive months?

I have a dataframe structured as follows:
Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Mar D
Jason Jan B
Sue Apr A
Jason Feb C
I want to get the list of students who got D 3 consecutive months in the past 6 months. In the example above, Sue will be on the list since she got D in Jan, Feb ad March. How can I do that using Python or Pandas or Numpy?
I tried to solve your problem. I do have a solution for you but it may not be the fastest in terms of efficiency / code execution. Please see below:
newdf = df.pivot(index='Name', columns='Month', values='Grade')
newdf = newdf[['Jan', 'Feb', 'Mar', 'Apr']].fillna(-1)
newdf['concatenated'] = newdf['Jan'].astype('str') + newdf['Feb'].astype('str') + newdf['Mar'].astype('str') + newdf['Apr'].astype('str')
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)]
Output will be like:
Month Jan Feb Mar Apr concatenated
Name
Sue D D D A DDDA
If you just want the names, then the following command instead.
newdf[newdf['concatenated'].str.contains('DDD', regex=False, na=False)].index.to_list()
I came up with this.
df['Month_Nr'] = pd.to_datetime(df.Month, format='%b').dt.month
names = df.Name.unique()
students = np.array([])
for name in names:
filter = df[(df.Name==name) & (df.Grade=='D')].sort_values('Month_Nr')
if filter['Month_Nr'].diff().cumsum().max() >= 2:
students = np.append(students, name)
print(students)
Output:
['Sue']
you have a few ways to deal with this, first use my previous solution but this will require mapping academic numbers to months (i.e September = 1, August = 12) that way you can apply math to work out consecutive values.
the following is to covert the Month into a DateTime and work out the difference in months, we can then apply a cumulative sum and filter any values greater than 3.
d = StringIO("""Name Month Grade
Sue Jan D
Sue Feb D
Jason Mar B
Sue Dec D
Jason Jan B
Sue Apr A
Jason Feb C""")
df = pd.read_csv(d,sep='\s+')
df['date'] = pd.to_datetime(df['Month'],format='%b').dt.normalize()
# set any values greater than June to the previous year.
df['date'] = np.where(df['date'].dt.month > 6,
(df['date'] - pd.DateOffset(years=1)),df['date'])
df.sort_values(['Name','date'],inplace=True)
def month_diff(date):
cumlative_months = (
np.round(((date.sub(date.shift(1)) / np.timedelta64(1, "M")))).eq(1).cumsum()
) + 1
return cumlative_months
df['count'] = df.groupby(["Name", "Grade"])["date"].apply(month_diff)
print(df.drop('date',axis=1))
Name Month Grade count
4 Jason Jan B 1
6 Jason Feb C 1
2 Jason Mar B 1
3 Sue Dec D 1
0 Sue Jan D 2
1 Sue Feb D 3
5 Sue Apr A 1
print(df.loc[df['Name'] == 'Sue'])
Name Month Grade date count
3 Sue Dec D 1899-12-01 1
0 Sue Jan D 1900-01-01 2
1 Sue Feb D 1900-02-01 3
5 Sue Apr A 1900-04-01 1

Python - Extract multiple values from string in pandas df

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
What I would like is to extract all the values of > 1950 and have this as an end result:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
I tried to extract values first, but didn't get further than:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?
With single regex pattern (considering your comment "need the year it took place"):
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019
Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019

Searching multiple substrings in a column of strings and return substring category

I have two dataframes as follows:
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea"]})
df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
"keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})
My dataframes look like this
df1:
id string
01 This is a cat
02 That is a dog
03 Those are birds
04 These are bats
05 I drink coffee
06 I bought tea
df2:
category keywords
1 cat
1 dog
2 birds
2 bats
3 coffee
3 tea
I would like to have an output column on df1 which is the category if at least one keyword in df2 is detected in each string in df1, otherwise return None. The expected output should be the following.
id string category
01 This is a cat 1
02 That is a dog 1
03 Those are birds 2
04 These are bats 2
05 I drink coffee 3
06 I bought tea 3
I can think of looping over keywords one-by-one and scan through string one-by-one but it is not efficient enough if data is getting big. May I have your suggestions how to improve? Thank you.
# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea",
"This won't match squat"]})
You can use a list comprehension involving next with the default argument.
df1['category'] = [
next((c for c, k in df2.values if k in s), None) for s in df1['string']]
df1
id string category
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought tea 3.0
6 07 This won't match squat NaN
You cannot avoid the O(N2) complexity, but this should be quite performant since it doesn't always have to iterate over every string in the inner loop (unless in the worst case).
Note that this currently only supports substring matches (not regex-based matching, although with a little modification that can be done).
Use list comprehension with split and match by dictionary created by df2:
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]
print (df1)
id string cat
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought thea NaN
Another easy to understand solution mapping df1['string']:
# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))
def categorize(s):
for cat in cats.keys():
if cat in s:
return cats[cat]
# return 0 in case nothing is found
return 0
df1['category'] = df1['string'].map(lambda x: categorize(x))
print(df1)
id string category
0 01 This is a cat 1
1 02 That is a dog 1
2 03 Those are birds 2
3 04 These are bats 2
4 05 I drink coffee 3
5 06 I bought tea 3

Categories

Resources