Working with text data

APK · Aug-21-2020, 08:28 AM

Hey everyone,

I am working with text data which has the following format in excel:

Interview 1
Speaker
Date
Text

Interview 2
Speaker
Date
Text

Interview 3
Speaker
Date
Text

... and so forth (there are 400 such interviews).

Having read the xlsx file into Python I would now like to structure the text data as follows:

interview_number Speaker Date Text
1 A 01/02/1999
1 A 01/02/1999
1 A 01/02/1999
1 A 01/02/1999
2 B 05/08/2000
2 B 05/08/2000

I am afraid I’m struggling to come up with a Python code/loop that helps me split up the text into such columns. Could you please help me with this?

Thanks a lot!

**Larz60+** · Aug-21-2020, 01:34 PM

Looks like homework, make an attempt and we will be glad to help.
Hint: look at https://pypi.org/project/openpyxl/

APK · (This post was last modified: Aug-21-2020, 09:48 PM by Larz60+.)

Hey, thanks a lot for the support!
I know openpyxl, have been working with it before, it’s just that I have never worked with data in this format, i.e. all information in rows and only one single column… My idea was to extract those specific rows, interview_number, speaker, date, and then create new columns so that they are all split up. However, I haven’t even been successful at extracting those rows. What I’d ideally do is establish that all the rows follow the same format, interview number, speaker, date text, and then loop over interview numbers 1 to 400 to repeat the process. I have been unsuccessfully trying something along those lines:

for row in rows:
    sheet.append(row)
    
for row in sheet.iter_rows(min_row=1, min_col=1, max_row=6, max_col=3):
    for cell in row:
        print(cell.value, end=" ")
    print()    

book.save('iterbyrows.xlsx')

However, I don’t know how to tell Python which rows “are special” to me, i.e. the first 3 rows after an empty row, interview number, speaker, date. Shall I loop over the interview number? Or use the fact that between interviews there is an empty row?

If I looped trough the rows how could I tell Python which ones to extract?

file = "enter_path_to_file_here"
wb = openpyxl.load_workbook(file, read_only=True)
ws = wb.active

for row in ws.iter_rows("E"):
    for cell in row:
        if cell.value == "interview":
            print(ws.cell(row=cell.row, column=2).value)

I’m honestly at my wits’ end.

**Larz60+** · Aug-21-2020, 09:51 PM

There's a lot of example code here: https://openpyxl.readthedocs.io/en/stable/usage.html

**buran** · (This post was last modified: Aug-22-2020, 04:49 AM by buran.)

First of all, is it realy excel file (because you said text data)? Maybe it's a file that you just happen to view in Excel? If it is not xlsx file, using openpyxl may not be necessary.
In nay case the main issue is to parse the data you have. So the question is: Is the file structure consistent - always 4 lines, blank line, 4 lines of data, etc.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Login Module Help - Comparing data in a text file to data held in a variable	KeziaKar	0	2,294	Mar-08-2018, 11:41 AM Last Post: KeziaKar

Working with text data

User Panel Messages

Announcements