Pandas: Merge two rows if row contains certain string

Pandas: Merge two rows if row contains certain string - python

I have a sheet that looks like this.
Fleet Risk Control
Communication
Interpersonal relationships
Demographic
Demographic
Q_21086
Q_21087
Q_21088
AGE
GENDER
1
3
4
27
Male
What I'm trying to achieve is where there is a row with 'Q_' inside of it, merge that cell with the top row and return a new dataframe.
So the existing data above would become something like this:
Fleet Risk Control - Q_21086
Communication - Q_21087
Interpersonal relationships - Q_21088
1
3
4
I honestly have no idea where to even begin with something like this.

You could try this one. This is for input:
import pandas as pd
df = pd.DataFrame({'Fleet Risk Control': ['Q_21086', 1],
'Communication': ['Q_21087', 3],
'Interpersonal relationships': ['Q_21088', 4],
'Demographic': ['AGE', 27],
'Demographic 2': ['Gender', 'Male']})
Now concat the header line with the first line of df:
df.columns = df.columns + ' - ' + df.iloc[0, :]
Extract every line without the first and dropping the last both columns
df = df.iloc[1:, :-2]

# rename columns
df.columns = [x + ' - ' + y if y.startswith('Q_') else x for x, y in zip(df.columns, df.iloc[0])]
#drop not matching columns
to_drop = [c for c, _ in df.iloc[0].apply(lambda x: not x.startswith('Q_')).items() if _]
df.drop(to_drop, axis=1)[1:]

Related

How to concatenate columns in CSV file using Python and Count the Total per UniqueID?

This question have been asked multiple times in this community but I couldn't find the correct answers since I am beginner in Python. I got 2 questions actually:
I want to concatenate 3 columns (A,B,C) with its value into 1 Column. Header would be ABC.
import os
import pandas as pd
directory = 'C:/Path'
ext = ('.csv')
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
if f.endswith(ext):
head_tail = os.path.split(f)
head_tail1 = 'C:/Output'
k =head_tail[1]
r=k.split(".")[0]
p=head_tail1 + "/" + r + " - Revised.csv"
mydata = pd.read_csv(f)
new =mydata[["A","B","C","D"]]
new = new.rename(columns={'D': 'Total'})
new['Total'] = 1
new.to_csv(p ,index=False)
Once concatenated, is it possible to count the uniqueid and put the total in Column D? Basically, to get the total count per uniqueid (Column ABC),the data can be found on a link when you click that UniqueID. For ex: Column ABC - uniqueid1, -> click -> go to the next page, total of that uniqueid.
On the link page, you can get the total numbers of uniqueid by Serial ID
I have no idea how to do this, but I would really appreciate if someone can help me on this project and would learn a lot from this.
Thank you very much. God Bless
Searched in Google, Youtube and Stackoverflow, couldn't find the correct answer.

I'm not sure that I understand your question correctly. However, if you know exactly the column names (e.g., A, B, and C) that you want to concatenate you can do something like code below.
''.join(merge_columns) is to concatenate column names.
new[merge_columns].apply(lambda x: ''.join(x), axis=1) is to concatenate their values.
Then, you can count unique values of the new column using groupby().count().
new = mydata[["A","B","C","D"]]
new = new.rename(columns={'D': 'Total'})
new['Total'] = 1
# added lines
merge_columns = ['A', 'B', 'C']
merged_col = ''.join(merge_columns)
new[merged_col] = new[merge_columns].apply(lambda x: ''.join(x), axis=1)
new.drop(merge_columns, axis=1, inplace=True)
new = new.groupby(merged_col).count().reset_index()
new.to_csv(p ,index=False)
example:
# before
> new
A B C Total
0 a b c 1
1 x y z 1
2 a b c 1
# after execute added lines
> new
ABC Total
0 abc 2
1 xyz 1

Next time, try to specify your issues and give a minimal reproducible example.
This is just an example how to use pd.melt and pd.groupby.
I hope it helps with your question.
import pandas as pd
### example dataframe
df = pd.DataFrame([['first', 1, 2, 3], ['second', 4, 5, 6], ['third', 7, 8, 9]], columns=['ID', 'A', 'B', 'C'])
### directly sum up A, B and C
df['total'] = df.sum(axis=1, numeric_only=True)
print(df)
### how to create a so called long dataframe with melt
df_long = pd.melt(df, id_vars='ID', value_vars=['A', 'B', 'C'], var_name='ABC')
print(df_long)
### group long dataframe by column and sum up all values with this ID
df_group = df_long.groupby(by='ID').sum()
print(df_group)

Pandas read_csv for a no quote file

I'm trying to read a file that doesn't have any quotes, which is causing inconsistent number of row lengths
Data looks as follows:
col_a, col_b
abc, inc., 5
xyz corb, 10
Since there are no quotes around "abc, inc.", this is causing the first row to get split into 3 values, but it should actually be just 2 values.
This column is not necessarily in the first position, and that there can be another bad column like this. The data has around 250 columns.
I'm reading this using pd.read_csv, how can this be resolved?
Thanks!

Its not a CSV but since there is only one column with the errant commas you can process with the csv module and fix the slice that holds too many column values. When a row has too many cells, assume they are the ones from the unescaped comma.
import pandas as pd
import csv
def split_badrows(fileobj, bad_col, total_cols):
"""Iterate rows, colapsing extra columns at bad_col"""
for row in csv.reader(fileobj):
row = [cell.strip() for cell in row]
extras = len(row) - total_cols
if extras > 0:
# colapse slice at troubled column into single value
extras += 1 # python slice doesn't include right endpoint
row[bad_col] = ", ".join(row[bad_col:bad_col+extras])
del row[bad_col+1:bad_col+extras]
yield row
def df_from_badtext(fileobj, bad_col):
"""Make pandas.DataFrame from badly formatted text"""
columns = [cell.strip() for cell in next(fileobj).split(",")]
total_cols = len(columns)
return pd.DataFrame(split_badrows(fileobj, bad_col, total_cols),
columns=columns)
# test
open("testme.txt", "w").write("""col_a, col_b
abc, inc., 5
xyz corb, 10""")
df = df_from_badtext(open("testme.txt"), bad_col=0)
print(df)

Data split to list then transform to dataframe.
csv = '''col_a, col_b
abc, inc., 5
xyz corb, 10'''+'\n'
import re
import pandas as pd
reArr = re.findall('(.*),([^,]+)\n',csv)
df=pd.DataFrame(reArr[1:],columns=reArr[0])
print(df)
col_a
col_b
0
abc, inc.
5
1
xyz corb
10

EDIT:
Thanks to tdelaney comment below:
see if this works
pd.read_csv('foo.csv',delimiter=",(?!( [\w\d]*).,)").dropna(axis=1)
OLD:
using delimiter as ",(?!.*,)" in read_csv seems to be solving this for me

EDIT (after updated question with an additional column):
Solution 1:
You can create a function with the bad column as a parameter and use split and concat to correct the dataframe depending on that bad column. Please note that the bad_col parameter in my function is the column number, where we start counting at 1, rather than 0 (e.g. 1, 2, 3, etc. instead of 0, 1, 2, etc.):
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col, col_a, col_b
000, abc, inc., 5
111, xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
def fix_csv(df, bad_col):
cols = df.columns.str.split(', ')[0]
x = len(cols) - bad_col
tmp = df.iloc[:,0].str.split(', ', expand=True, n=x)
df = pd.concat([tmp.iloc[:,0],
tmp.iloc[:,-1].str.rsplit(', ', expand=True, n=x)],
axis=1)
df.columns = cols
return df
fix_csv(df, bad_col=2)
Solution 2 (this is if you have issues in multiple columns and you need to use more brute force):
It sounds like there is a possibility that you there could be multiple columns affected from the comments as you mentioned only 1 "so far".
As such, this might be a little bit of a project to clean up the data. The following code can give you an idea how to do that. The bottom-line is that you can create two different dataframes: 1) The first dataframe has the minimum number of commas (i.e. they should be the rows without any issues). 2) The other dataframe will be the dataframe with all of the issues. I've shown how you can clean the data to get to the correct number of columns and then change the data back and concat the two dataframes.
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col, col_a, col_b
000, abc, inc., 5
111, xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
cols = df.columns.str.split(', ')[0]
s = df.iloc[:,0].str.count(',')
df1 = df.copy()[s.eq(s.min())]
df1 = df1.iloc[:,0].str.split(', ', expand=True)
df1.columns = cols
df2 = df.copy()[s.gt(s.min())]
#inspect this dataframe manually to see how many rows affected, which columns, etc.
#cleanup df2 with some .replace so all equal commas
original = [', inc.', ', corp.']
temp = [' inc.', ' corp.']
df2.iloc[:,0] = df2.iloc[:,0].replace(original, temp, regex=True)
df2 = df2.iloc[:,0].str.split(', ', expand=True)
df2.columns = cols
#cleanup df2 by changing back to original values
df2['col_a'] = df2['col_a'].replace(temp, original, regex=True) # you can do this with other columns as well
df3 = pd.concat([df1, df2]).sort_index()
df3
Out[1]:
col col_a col_b
0 000 abc, inc. 5
1 111 xyz corb 10
Solution 3: Previous Solution (for original question when problem was only in first column - for reference)
You can read in with sep="|" as that | character is not in your .csv, so it reads all of the data into one column.
The main assumption to my solution is that the problematic column is only the first column. I use rsplit(', ') and limit the number of splits to the total number of columns minus 1 (with the example data, this is 2-1=1). Hopefully, this solves with your actual data or at least gives you some idea. If your data is separated by , instead of , , please note whether or not to adjust my splits as well.
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col_a, col_b
abc, inc., 5
xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
cols = df.columns.str.split(', ')[0]
x = len(cols) - 1
df = df.iloc[:,0].str.rsplit(', ', expand=True, n=x)
df.columns = cols
df
Out[1]:
col_a col_b
0 abc, inc. 5
1 xyz corb 10

Appending Multi-index column headers to existing dataframe

I'm looking to append a multi-index column headers to an existing dataframe, this is my current dataframe.
Name = pd.Series(['John','Paul','Sarah'])
Grades = pd.Series(['A','A','B'])
HumanGender = pd.Series(['M','M','F'])
DogName = pd.Series(['Rocko','Oreo','Cosmo'])
Breed = pd.Series(['Bulldog','Poodle','Golden Retriever'])
Age = pd.Series([2,5,4])
DogGender = pd.Series(['F','F','F'])
SchoolName = pd.Series(['NYU','UCLA','UCSD'])
Location = pd.Series(['New York','Los Angeles','San Diego'])
df = (pd.DataFrame({'Name':Name,'Grades':Grades,'HumanGender':HumanGender,'DogName':DogName,'Breed':Breed,
'Age':Age,'DogGender':DogGender,'SchoolName':SchoolName,'Location':Location}))
I want add 3 columns on top of the existing columns I already have. For example, columns [0,1,2,3] should be labeled 'People', columns [4,5,6] should be labeled 'Dogs', and columns [7,8] should be labeled 'Schools'. In the final result, it should be 3 columns on top of 9 columns.
Thanks!

IIUC, you can do:
newlevel = ['People']*4 + ['Dogs']*3 + ['Schools']*2
df.columns = pd.MultiIndex.from_tuples([*zip(newlevel, df.columns)])
Note [*zip(newlevel, df.columns)] is equivalent to
[(a,b) for a,b in zip(new_level, df.columns)]

How can I create string indexing instead of numbers from dataframe?

I want to create unique row identifiers , in place of the index column, from the contents present in the columns of a dataframe.
For example,
import pandas as pd
from pprint import pprint
df = pd.DataFrame(columns=["ID", "Animal", "Weight", "Description"])
df["ID"] = ["Qw9457", "gft878"]
df["Animal"] = ["Mouse", "Lion"]
df["Weight"] = [20, 67]
df["Description"] = ["hsdg rie", "gtre sjdhi"]
pprint(df)
Output:
ID Animal Weight Description
0 Qw9457 Mouse 20 hsdg rie
1 gft878 Lion 67 gtre sjdhi
I'd prefer to rename the index column using the contents present in the rest of the columns,
for example :
df.index = ["MQwrie", "Lgfgt"]
I would like to know if there are nice ways to programmatically generate
row identifiers(i.e index column) from the contents of columns.

If you are looking to generate an index based on bits of the data in each column, you can piece it together using Series operations and then assign the index. Below, we use the first letter of the animal's name, the weight, and the first word of the description as a new index.
import pandas as pd
df = pd.DataFrame({'ID': ['Qw9457', 'gft878'],
'Animal': ['Mouse', 'Lion'],
'Weight': [20, 67],
'Description': ['hsdg rie', 'gtre sjdhi']})
# create new index from data in df, assign as index
ix = df.Animal.str[0] + df.Weight.astype(str) + df.Description.str.split().str.get(0)
df_new = df.set_index(ix)
df_new
# returns:
ID Animal Weight Description
M20hsdg Qw9457 Mouse 20 hsdg rie
L67gtre gft878 Lion 67 gtre sjdhi
EDIT:
Yes, you add the current row number (starting at zero), you can use:
ix = (
df.Animal.str[0]
+ df.Weight.astype(str)
+ df.Description.str.split().str.get(0)
+ df.index.astype(str).str.zfill(3)
)
df_new = df.set_index(ix)
df_new
#returns:
ID Animal Weight Description
M20hsdg000 Qw9457 Mouse 20 hsdg rie
L67gtre001 gft878 Lion 67 gtre sjdhi

How Can I create three new columns for my data?

I've got some data that looks like
tweet_id worker_id option
397921751801147392 A1DZLZE63NE1ZI pro-vaccine
397921751801147392 A3UJO2A7THUZTV pro-vaccine
397921751801147392 A3G00Q5JV2BE5G pro-vaccine
558401694862942208 A1G94QON7A9K0N other
558401694862942208 ANMWPCK7TJMZ8 other
What I would like is a single line for each tweet id, and three 6 columns identifying the worker id and the option.
It the desired output is something like
tweet_id worker_id_1 option_1 worker_id_2 option_2 worker_id_3 option 3
397921751801147392 A1DZLZE63NE1ZI pro-vaccine A3UJO2A7THUZTV pro_vaccine A3G00Q5JV2BE5G pro_vaccine
How can I achieve this with pandas?

This is about reshaping data from long to wide format. You can create a grouped count column as id to spread as new column headers and then use pivot_table(), finally rename the columns by pasting the multi-level together.
df['count'] = df.groupby('tweet_id').cumcount() + 1
df1 = df.pivot_table(values = ['worker_id', 'option'], index = 'tweet_id',
columns = 'count', aggfunc='sum')
df1.columns = [x + "_" + str(y) for x, y in df1.columns]
An alternative option to pivot_table() is unstack():
df['count'] = df.groupby('tweet_id').cumcount() + 1
df1 = df.set_index(['tweet_id', 'count']).unstack(level = 1)
df1.columns = [x + "_" + str(y) for x, y in df1.columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Merge two rows if row contains certain string - python

# rename columns df.columns = [x + ' - ' + y if y.startswith('Q_') else x for x, y in zip(df.columns, df.iloc[0])] #drop not matching columns to_drop = [c for c, _ in df.iloc[0].apply(lambda x: not x.startswith('Q_')).items() if _] df.drop(to_drop, axis=1)[1:]

Related

How to concatenate columns in CSV file using Python and Count the Total per UniqueID?

Pandas read_csv for a no quote file

Appending Multi-index column headers to existing dataframe

How can I create string indexing instead of numbers from dataframe?

How Can I create three new columns for my data?

Categories

Resources